Education 257 HW4 April 9, 2005 (Due April 18 2005) Note: from lecture 3/30 good to review HW3 #8 problem and solution =================================== Variable Selection and Model Building ===================================== The Job proficiency data described p.356-7 (ver 4) p 377 (ver5) of NWK resides as jobprof.dat. From problem description: A personnel officer in a governmental agency administered four newly developed aptitude tests to each of 25 applicants for entry-level clerical positions in the agency. For purposes of the study, all 25 applicants were accepted for positions irrespective of their test scores. After a probationary period, each applicant was rated for proficiency on the job. The scores on the four tests X1-X4 are in columns c2-c5 and the job proficiency score Y is in c1. Note: the data display in NWK has Y in the rightmost column; that's not the way the data are stored. ====================================================================== Prob 1. --------- For these data complete the following (parts of) problems: 8.11 (ver4) 9.10 (ver5) parts a-c 8.12 (ver 4) 9.11 (ver5) a, b (use BREG) 8.19 (ver4) 9.18 (ver5) a, b ----------------------------------------------------------- Prob 2 --------- Instead of considering the aptitude tests in Part 1 as separate candidate predictors, lets see if some composite of the 4 tests is useful. (this is probably not a realistic opportunity as it's unlikely all 4 tests would be given routinely). construct 2 composite measures: 1. sum (or mean) of the four tests (reasonable since they are on the same scale) and 2. standardized sum (i.e. standardize and add 'em up). (e.g., use Minitab CENTER command) which of the composite measures is the best predictor? how does it compare with some of the prediction eq's in Part 1? consider also forming a composite from 1st principal component of the aptitude tests and using that as a predictor ----------------------------------------------------------------- Prob 3 ------- Refer to the Course Example pca257 (data in pcamarks on the web-page) Extend the analysis in the course example by doing the following. Use the principal components (e.g. obtained using MINITAB pca) for the six graded homework assignments as potential predictors of the final exam in 250B, along with the final exam for 250A and the midterm in 250B. Carry out an appropriate variable selection procedure to build a prediction equation. What is the most attractive prediction eq? why? How competitive is the next most atractive equation? ============== Prob 4 In the file 'gpa.dat' are two sets of data each on 100 cases. The first set is contained in the first three columns and the second set in the next three columns. For each individual the three observations are VerbalSAT MathSAT and GPA . The problem asks you to construct a simple cross-validation procedure. For the first 100 cases predict GPA using the two SAT scores. This yields estimated regression parameters and a squared multiple correlation. Now lets turn to the second sample of 100 cases. Use the regression coefficients from the first sample to form a predicted outcome for each of the 100 individuals in the second sample. Compute an imitation R-squared and compare with that for an actual multiple regression for the second sample. Which is larger? Why? You could of course reverse this process by starting with the second sample instead of the first. --------------------------------------------------------- Advanced Topics (you can treat these as optional but interesting) Prob 5. Path analysis Consider the published path analysis depicted in http://www.stanford.edu/class/ed260/allisonWebex1.jpg write out the indicated multiple regression equations from this path analysis diagram From the 5x5 correlation matrix Correlation Matrix class 1.00 famsize -.33 1.00 ability .39 -.33 1.00 esteem .14 -.14 .19 1.00 achieve .43 -.28 .67 .22 1.00 obtain standardized path coefficients and propose substantive interpretation. --------------- Problem 6. Multilevel data NELS data from Kreft text) Data summaries for the 10 school example are given below. Fit Math score on Homework regressions From these data summaries obtain the three regression slopes discussed in contextual analysis: total between-school, within-school pooled. Verify the Duncan-Cuzort-Duncan relationship. Table 1 Ten selected schools from NELS-88: within-school means School Size Math mean Homework mean 1 23 45.8 1.39 2 20 42.2 2.35 3 24 53.2 1.83 4 22 43.6 1.64 5 22 49.7 0.86 6 20 46.4 1.15 7. 67 62.8 3.30 8 21 49.6 2.10 9 21 46.3 1.33 10 20 47.8 1.60 Table 1 gives the mean math score (number correct) amounts of homework (in hours per week), Table 2 Ten selected schools from NELS-88: within-school dispersions and correlations School Dispersion Correlation A 55.2 -4.24 -0.52 -4.24 1.19 B 65.1 -4.65 -0.45 -4.65 1.63 C 126.3 9.62 0.77 9.62 1.22 D 94.1 11.9 0.84 11.9 2.14 . E 69.2 -2.71 -0.43 -2.71 0.57 F 17.0 -1.56 -0.48 -1.56 0.63 G 31.2 3.24 0.34 3.24 2.92 . H 101.1 7.94 0.71 7.94 1.22 . I 86.6 4.61 0.56 4.61 0.79 . J 120.9 12.3 0.80 12.3 1.94 . ======================================== end