Solutions HOMEWORK 4 Ed257 D Rogosa April 17, 2005 -------------------------------------------------------------- Problem 1 (ver4 problem numbers used) MTB > read 'jobprof.dat' c1-c5 25 ROWS READ ROW C1 C2 C3 C4 C5 1 88 86 110 100 87 2 80 62 97 99 100 3 96 110 107 103 103 4 76 101 117 93 95 . . . ****Note which column outcome measures resides in************************ MTB > stem c2-c5 Stem-and-leaf of C2 N = 25 Leaf Unit = 1.0 1 6 2 3 7 48 6 8 467 10 9 1468 (7) 10 0144569 8 11 02 6 12 000 3 13 3 2 14 0 1 15 0 Stem-and-leaf of C3 N = 25 Leaf Unit = 1.0 1 7 3 2 7 7 4 8 13 6 8 59 7 9 4 8 9 7 10 10 12 11 10 7 (3) 11 034 11 11 789 8 12 01112 3 12 599 Stem-and-leaf of C4 N = 25 Leaf Unit = 1.0 1 8 0 1 8 5 9 0133 12 9 5556789 (4) 10 0023 9 10 56789 4 11 34 2 11 56 Stem-and-leaf of C5 N = 25 Leaf Unit = 1.0 1 7 4 2 7 8 5 8 034 10 8 57889 11 9 0 (3) 9 557 11 10 022334 5 10 5889 1 11 0 MTB > describe c2-c5 N MEAN MEDIAN TRMEAN STDEV SEMEAN C2 25 103.36 104.00 103.13 20.30 4.06 C3 25 106.72 113.00 107.22 17.29 3.46 C4 25 100.80 100.00 101.04 8.86 1.77 C5 25 94.68 95.00 94.91 10.68 2.14 MIN MAX Q1 Q3 C2 62.00 150.00 89.00 116.00 C3 73.00 129.00 91.50 121.00 C4 80.00 116.00 95.00 107.50 C5 74.00 110.00 86.00 103.50 8.11. the second predictor c3 is pretty strongly skew (bunched toward the top) see also from DESCRIBE above. we'll look at its scatterplot below. MTB > plot c1 c2 - 125+ * * - C1 - * * - * * * - * 100+ * * * - * * - * - * - * * * * 75+ * * - * - * * - * - 50+ --+---------+---------+---------+---------+---------+----C2 64 80 96 112 128 144 MTB > plot c1 c3 - 125+ * * - C1 - ** - * * * - * 100+ * * * - * * - * - * - * * * * 75+ * * - * - * * - * - 50+ ----+---------+---------+---------+---------+---------+--C3 72 84 96 108 120 132 MTB > plot c1 c4 - 125+ * * - C1 - * * - * * * - * 100+ * * * - * * - * - * - * 2 * 75+ * * - * - ** - * - 50+ --------+---------+---------+---------+---------+--------C4 84.0 91.0 98.0 105.0 112.0 MTB > plot c1 c5 - 125+ * * - C1 - * * - * * * - * 100+ * * * - * * - * - * - * * * * 75+ * * - * - * * - * - 50+ --------+---------+---------+---------+---------+--------C5 77.0 84.0 91.0 98.0 105.0 again c3 has a scatterplot which may have a little curvature. in a full analysis one might want to consider a transformation of c3 instead of c3 as a predictor. MTB > corr c1-c5 C1 C2 C3 C4 C2 0.514 C3 0.497 0.102 C4 0.897 0.181 0.519 C5 0.869 0.327 0.397 0.782 correlations among predictors (c2-c5) are not at all near large enough to create any collinearity problems (at least what we can see from zero- order corr's, which may not tell the whole story). Could use VIF. MTB > brief 3 MTB > regress c1 4 c2-c5 c10 c11 The regression equation is C1 = - 124 + 0.296 C2 + 0.0483 C3 + 1.31 C4 + 0.520 C5 Predictor Coef Stdev t-ratio p Constant -124.382 9.941 -12.51 0.000 C2 0.29573 0.04397 6.73 0.000 C3 0.04829 0.05662 0.85 0.404 C4 1.3060 0.1641 7.96 0.000 C5 0.5198 0.1319 3.94 0.001 s = 4.099 R-sq = 96.3% R-sq(adj) = 95.5% Analysis of Variance SOURCE DF SS MS F p Regression 4 8718.0 2179.5 129.74 0.000 Error 20 336.0 16.8 Total 24 9054.0 SOURCE DF SEQ SS C2 1 2395.9 C3 1 1807.0 C4 1 4254.5 C5 1 260.7 Obs. C2 C1 Fit Stdev.Fit Residual St.Resid 1 86 88.000 82.188 1.307 5.812 1.50 2 62 80.000 79.914 2.430 0.086 0.03 3 110 96.000 101.375 1.158 -5.375 -1.37 4 101 76.000 81.978 1.823 -5.978 -1.63 5 100 80.000 79.883 0.998 0.117 0.03 6 78 73.000 70.525 1.687 2.475 0.66 7 120 58.000 57.771 2.478 0.229 0.07 8 105 116.000 117.079 1.876 -1.079 -0.30 9 112 104.000 107.504 1.244 -3.504 -0.90 10 120 99.000 102.956 1.725 -3.956 -1.06 11 87 64.000 68.543 1.736 -4.543 -1.22 12 133 126.000 124.464 1.752 1.536 0.41 13 140 94.000 94.504 2.248 -0.504 -0.15 14 84 71.000 74.451 2.039 -3.451 -0.97 15 106 111.000 110.906 1.601 0.094 0.02 16 109 109.000 103.435 2.144 5.565 1.59 17 104 100.000 94.004 1.847 5.996 1.64 18 150 127.000 122.598 2.224 4.402 1.28 19 98 99.000 101.067 1.476 -2.067 -0.54 20 120 82.000 86.499 1.338 -4.499 -1.16 21 74 67.000 66.376 2.158 0.624 0.18 22 96 109.000 111.939 1.666 -2.939 -0.78 23 104 78.000 72.943 2.117 5.057 1.44 24 94 115.000 113.512 1.730 1.488 0.40 25 91 83.000 78.586 2.001 4.414 1.23 MTB > describe c10-c11 N MEAN MEDIAN TRMEAN STDEV SEMEAN C10 25 0.016 0.026 0.017 1.016 0.203 C11 25 92.20 94.00 92.29 19.06 3.81 MIN MAX Q1 Q3 C10 -1.628 1.639 -0.934 0.948 C11 57.77 124.46 76.52 109.20 MTB > plot c10 c11 C10 - - * * - * * 1.2+ * * - - * - * * - * 0.0+ * 2 * - * * - * - * - * * * -1.2+ * * - * - * - --+---------+---------+---------+---------+---------+----C11 60 72 84 96 108 120 seems like a darn good prediction. plot of st. resids vs fit shows no pattern. not even any large residuals identified. ------------- ----------------------------------------- start 8.12 MTB > breg c1 c2-c5; SUBC> best 4. Best Subsets Regression of C1 Adj. C C C C Vars R-sq R-sq C-p s 2 3 4 5 1 80.5 79.6 84.2 8.7676 X 1 75.6 74.5 110.6 9.8039 X 1 26.5 23.3 375.3 17.014 X 1 24.7 21.4 384.8 17.217 X 2 93.3 92.7 17.1 5.2512 X X 2 87.7 86.6 47.2 7.1073 X X 2 81.5 79.8 80.6 8.7193 X X 2 80.6 78.8 85.5 8.9336 X X 3 96.2 95.6 3.7 4.0720 X X X 3 93.4 92.5 18.5 5.3306 X X X 3 87.9 86.2 48.2 7.2237 X X X 3 84.5 82.3 66.3 8.1653 X X X 4 96.3 95.5 5.0 4.0986 X X X X a. best sets by me (using adjR^2 and/or Cp) in order would be {c2,c4,c5}, {c2,c3,c4,c5}, {c2, c4}, {c2,c3,c4} For reference remember X1 is in c2, X2 in c3 etc ...... b Cp and adjR^2 say {c2,c4,c5} best, but {c2,c4} is close enough that economics of testing might indicate just giving & using those two tests. In Problem 4 below we use the {c2,c4} (X1,X3) predictor set in comparison with the validation sample--would be at least equally reasonable to use {c2,c4,c5} in that comparison instead. note: interested parties may want to go on and play with the exercises in 12.15 ------------- start 8.19 MTB > stepwise c1 c2-c5; SUBC> fenter = 4.0; SUBC> fremove = 3.9. STEPWISE REGRESSION OF C1 ON 4 PREDICTORS, WITH N = 25 STEP 1 2 3 CONSTANT -106.1 -127.6 -124.2 C4 1.97 1.82 1.36 T-RATIO 9.74 14.81 8.94 C2 0.348 0.296 T-RATIO 6.49 6.78 C5 0.52 T-RATIO 3.95 S 8.77 5.25 4.07 R-SQ 80.47 93.30 96.15 MORE? (YES, NO, SUBCOMMAND, OR HELP) SUBC> n So stepwise finds {c2,c4,c5} which was the best from breg. Note: the MTB STEPWISE command (p 7-18) does have a 'best' subcommand which doesn't provide the same info as breg just gives the best alternatives at each step. you might want to try out this same stepwise with looser criteria--fenter=1. ============================================================= ----------------------------- Problem 2 Construct simple composite measures--mean (or sum) and mean of standarized measures MTB > rmean c2-c5 c21 MTB > center c2-c5 c32-c35 MTB > rmean c32-c35 c22 (note: if you prefer sum multiply c22 by 4) (Note-- C12 the first principal component, it's construction at the end of this problem ) Correlations with outcome, C1 MTB > corr c1 c12 c21 c22 C1 C12 C21 C12 -0.950 C21 0.897 -0.937 C22 0.946 -0.987 0.980 Each composite has a pretty healthy correlation with outcome; sum of standarized measures has a higher correlation with outcome than the simple sum of the measures. (First Principal component even slightly better) Regression of C1 on C22 would have an R^2 of .895; not quite as good as the best prediction equations using the set of four tests as candidate predictors separately (even more properly using AdjR^2 to compare). Fits from the diff eq's would be very close. -------------------------------------------- -- Additional notes on Principal Components First lets get principal component scores for the 4 predictors. MTB > pca c2-c5; SUBC> scores c12-c15. Eigenanalysis of the Correlation Matrix Eigenvalue 2.2529 0.9422 0.6134 0.1914 Proportion 0.563 0.236 0.153 0.048 Cumulative 0.563 0.799 0.952 1.000 Variable PC1 PC2 PC3 PC4 C2 -0.279 0.900 -0.308 0.130 C3 -0.459 -0.390 -0.785 -0.145 C4 -0.600 -0.183 0.310 0.714 C5 -0.593 0.064 0.439 -0.672 Note: see the 1stpc picks up most of variance--other eigenv < 1. pc scores are uncorrelated & their variances are the eigenvalues MTB > corr c12-c16 C12 C13 C14 C15 C13 0.000 C14 0.000 -0.000 C15 -0.000 0.000 -0.000 C16 * * * * * NOTE * No data in column MTB > describe c12-c15 N MEAN MEDIAN TRMEAN STDEV SEMEAN C12 25 0.000 0.269 -0.034 1.501 0.300 C13 25 -0.000 0.037 -0.010 0.971 0.194 C14 25 0.000 -0.078 0.012 0.783 0.157 C15 25 0.0000 0.0404 0.0107 0.4375 0.0875 MIN MAX Q1 Q3 C12 -2.326 3.119 -1.409 1.088 C13 -1.546 1.778 -0.835 0.852 C14 -1.606 1.340 -0.495 0.687 C15 -0.8927 0.6475 -0.3142 0.3756 Regress outcome on first principal component MTB > brief 2 MTB > regress c1 1 c12 The regression equation is C1 = 92.2 - 12.3 C12 Predictor Coef Stdev t-ratio p Constant 92.200 1.244 74.09 0.000 C12 -12.2873 0.8462 -14.52 0.000 s = 6.222 R-sq = 90.2% R-sq(adj) = 89.7% Analysis of Variance SOURCE DF SS MS F p Regression 1 8163.5 8163.5 210.85 0.000 Error 23 890.5 38.7 Total 24 9054.0 Unusual Observations Obs. C12 C1 Fit Stdev.Fit Residual St.Resid 4 0.27 76.00 88.88 1.27 -12.88 -2.11R 23 2.23 78.00 64.80 2.26 13.20 2.28R R denotes an obs. with a large st. resid. Well R-sq(adj) = .897 is a bit lower than the .95 obtained from the best eq's with the tests as separate predictors, but decent. ------------------------ 3. Course Example pca257 MTB > # read in data by cut-and-paste from website data file MTB > name c9='finb' c8='mtb' c7='fina' MTB > info Information on the Worksheet Column Count Name C1 18 C2 18 C3 18 C4 18 C5 18 C6 18 C7 18 fina C8 18 mtb C9 18 finb MTB > pca c1-c6; SUBC> scores c11-c16. Principal Component Analysis Eigenanalysis of the Correlation Matrix Eigenvalue 2.2380 1.3818 1.2031 0.6652 0.4026 0.1093 Proportion 0.373 0.230 0.201 0.111 0.067 0.018 Cumulative 0.373 0.603 0.804 0.915 0.982 1.000 Variable PC1 PC2 PC3 PC4 PC5 PC6 C1 -0.429 0.365 0.017 0.702 0.436 -0.005 C2 -0.358 0.310 -0.472 -0.610 0.391 0.168 C3 -0.086 -0.180 -0.825 0.294 -0.377 -0.227 C4 0.091 -0.765 -0.127 0.081 0.606 0.132 C5 -0.584 -0.282 0.132 0.030 -0.384 0.643 C6 -0.575 -0.273 0.250 -0.204 -0.026 -0.700 MTB > name c11='hwpc1' MTB > name c12='hwpc2' MTB > name c13='hwpc3' MTB > info Information on the Worksheet Column Count Name C1 18 C2 18 C3 18 C4 18 C5 18 C6 18 C7 18 fina C8 18 mtb C9 18 finb C11 18 hwpc1 C12 18 hwpc2 C13 18 hwpc3 C14 18 C15 18 C16 18 MTB > breg c9 c7 c8 c11 c12 c13; SUBC> best 5. Best Subsets Regression Response is finb h h h f w w w i m p p p Adj. n t c c c Vars R-Sq R-Sq C-p s a b 1 2 3 1 72.7 71.0 3.3 10.823 X 1 53.7 50.8 15.4 14.107 X 1 23.6 18.8 34.5 18.121 X 1 2.2 0.0 48.0 20.498 X 1 0.5 0.0 49.1 20.678 X 2 77.1 74.0 2.6 10.254 X X 2 74.2 70.7 4.4 10.876 X X 2 73.9 70.4 4.6 10.938 X X 2 73.6 70.0 4.8 11.005 X X 2 56.4 50.6 15.7 14.136 X X 3 80.2 75.9 2.6 9.8695 X X X 3 77.8 73.1 4.1 10.437 X X X 3 77.7 73.0 4.1 10.458 X X X 3 75.2 69.9 5.7 11.041 X X X 3 75.1 69.8 5.8 11.053 X X X 4 80.7 74.8 4.2 10.096 X X X X 4 80.4 74.4 4.4 10.168 X X X X 4 78.6 72.0 5.6 10.641 X X X X 4 76.0 68.6 7.2 11.269 X X X X 4 58.4 45.6 18.4 14.832 X X X X 5 81.1 73.2 6.0 10.410 X X X X X The breg output is more interesting than I expected it to be. The very best prediction eq (3 preds) appears to be using fina hwpc1 hwpc3. Many of the good eqs use two principal components from HW as predictors. My illustrative eq in the Course Example (using fina, mtb, and PC1) is the fifth best of the 3 pred eqs (but not that far behind). Do others agree? ================================================ problem 4 This is an exercise in cross-validation. R-squared really cannot tell you how well you are doing if the same data are used to create the fit AND evaluate how good the fit is. 100 ROWS READ ROW C1 C2 C3 C4 C5 C6 1 623 509 2.6 545 643 3.0 2 454 471 2.3 558 602 2.3 3 643 700 2.4 544 665 2.0 4 585 719 3.0 646 573 2.0 . . . MTB > name c1 'VSAT1' c2 'MSAT1' c3 'GPA1' c4 'VSAT2' c5 'MSAT2' c6 'GPA2' MTB > regress 'GPA1' 2 'VSAT1' 'MSAT1' The regression equation is GPA1 = 0.471 + 0.00356 VSAT1 +0.000158 MSAT1 Predictor Coef Stdev t-ratio p Constant 0.4706 0.5433 0.87 0.388 VSAT1 0.0035628 0.0007350 4.85 0.000 MSAT1 0.0001576 0.0008514 0.19 0.854 s = 0.5018 R-sq = 23.5% R-sq(adj) = 22.0% Analysis of Variance SOURCE DF SS MS F p Regression 2 7.5137 3.7568 14.92 0.000 Error 97 24.4227 0.2518 Total 99 31.9364 SOURCE DF SEQ SS VSAT1 1 7.5051 MSAT1 1 0.0086 Unusual Observations Obs. VSAT1 GPA1 Fit Stdev.Fit Residual St.Resid 2 454 2.3000 2.1624 0.1544 0.1376 0.29 X 40 490 1.2000 2.3269 0.1149 -1.1269 -2.31R 54 592 2.4000 2.6493 0.1863 -0.2493 -0.54 X 89 361 2.4000 1.8517 0.1682 0.5483 1.16 X R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence. From the model generated to describe the first 100 observations, the prediction of GPA from VSAT and MSAT is: GPA = 0.471 + 0.00356*VSAT + 0.000158*MSAT We can predict GPA for the second 100 observations using the model derived from the first 100: MTB > name c7 'PredGPA2' MTB > let c7 = .471 + (.00356*'VSAT2') + (.000158*'MSAT2') Remembering that one definition of R-squared is the square of the correlation between observed and fitted Y values, we can compute an 'Cross-validation or imitation R-sq' by simply correlating the predicted GPA's for observations in the second group (PredGPA2) with the values actually observed for this group (GPA2). MTB > plot 'PredGPA2' 'GPA2' - * - * * 3.20+ - * * PredGPA2- * * ** * - * * 2 * * - ** 2* ** * 2* * * * 2.80+ 2 ** 2 * * * - * * * 2* * * - * * *2 * *** *2 2 - * * 2* 2 * * * ** - 2 * ** ** 2 2 2.40+ * 3 * - ** * * ** - * 2 - * - * --------+---------+---------+---------+---------+--------GPA2 0.70 1.40 2.10 2.80 3.50 MTB > corr 'PredGPA2' 'GPA2' Correlation of PredGPA2 and GPA2 = 0.155 Thus our 'cross-validation R-squared' is .024; not too high. The model based on the first 100 observations does not seem to predict the GPA of the second 100 observations very well. Below is a model for the second set of GPA's based on data from that set. MTB > regress 'GPA2' 2 'VSAT2' 'MSAT2' The regression equation is GPA2 = 1.11 + 0.00112 VSAT2 + 0.00121 MSAT2 Predictor Coef Stdev t-ratio p Constant 1.1136 0.7092 1.57 0.120 VSAT2 0.0011166 0.0008298 1.35 0.182 MSAT2 0.0012070 0.0008728 1.38 0.170 s = 0.5791 R-sq = 4.1% R-sq(adj) = 2.1% Analysis of Variance SOURCE DF SS MS F p Regression 2 1.3981 0.6991 2.08 0.130 Error 97 32.5303 0.3354 Total 99 33.9284 SOURCE DF SEQ SS VSAT2 1 0.7568 MSAT2 1 0.6414 Unusual Observations Obs. VSAT2 GPA2 Fit Stdev.Fit Residual St.Resid 21 780 1.3000 2.8198 0.1667 -1.5198 -2.74R 31 578 0.3000 2.4941 0.0664 -2.1941 -3.81R 36 760 1.1000 2.8192 0.1554 -1.7192 -3.08R R denotes an obs. with a large st. resid. Note that the prediction of GPA from VSAT and MSAT for group 2 is not very strong even when we use a model based on those data. The prediction of Group 2's GPA using the model based on Group 2's data is slightly stronger than that using a model generated on Group 1's data however. (Compare the 'imitation R-squared' of .024 with the R-squared for the regression above, .041.) Using the fit from the second sample to predict the outcome in the first sample yields a cross-validation R-squared = .18, less than .235 from the direct fit to the first sample. This reduction is often termed "shrinkage". ------------------------------------------------------ Advanced topics (lecture April 6) ----------------------------------------------------- ------------------------------------------- problem 5 Multilevel Regression Kreft text Chap 2 Table 3.2 OLS regressian lines af 'MathAchievement' on HomeWork' over 10 schools Intercept Slope School Estimate SE Estimate SE N 1 50.70 2.24 -3.55 1.27 23 2 48.88 3.56 -2.86 1.33 20 3 38.80 2.94 7.87 1.37 24 4 34.49 1.76 5.58 0.80 22 5 53.82 2.55 -4.74 2.22 22 6 49.30 1.51 -2.49 1.08 20 7 59.10 1 .42 1.11 0.38 67 8 36.05 3.46 6.46 1.46 21 9 38.56 3.19 5.81 1.99 21 10 37.71 2.36 6.33 1.11 20 Total 44.05 0.98 3.57 0.39 260 Between regression Intercept Slope Estimate SE Estimate SE 37.1 4.03 7.0 1.84 within-pooled, relative standing regression Intercept Slope Estimate SE Estimate SE 2.1 .43 ------------------------------------------------- Problem 6. Path analysis Solution via CALIS (SAS from Paul Allison's class notes) This is a just-identified model. All possible relationships are present. Correlation matrix: Class FamSize Ability Esteem Achieve Class 1.00 FamSize -.33 1.00 Ability .39 -.33 1.00 Esteem .14 -.14 .19 1.00 Achieve .43 -.28 .67 .22 1.00 ----------------------- SAS code -------------------- DATA pathan(TYPE=CORR); INPUT _NAME_ $ class famsize ability esteem achieve; _TYPE_='CORR'; DATALINES; class 1.00 . . . . famsize -.33 1.00 . . . ability .39 -.33 1.00 . . esteem .14 -.14 .19 1.00 . achieve .43 -.28 .67 .22 1.00 ; PROC CALIS DATA=pathan; LINEQS achieve= b1 esteem + b2 ability + b3 class + b4 famsize + e3, esteem= b5 ability + b6 class + b7 famsize + e2, ability = b8 class + b9 famsize + e1; STD e1-e3=v1-v3; RUN; Covariance Structure Analysis: Maximum Likelihood Estimation Manifest Variable Equations with Standardized Estimates Ability = 0.3155*Class + -0.2259*FamSize + 0.8958 e1 b8 b9 Esteem = 0.1423*Ability + 0.0604*Class + -0.0731*FamSize+ 0.9769 e2 b5 b6 b7 Achieve = 0.5754*Ability + 0.0820*Esteem + 0.1887*Class+ -0.0164*FamSize + 0.7146 e3 b2 b1 b3 b4 Squared Multiple Correlations Error Total Variable Variance Variance R-Square 1 Ability 0.80243 1.00000 0.1976 2 Esteem 0.95427 1.00000 0.0457 3 Achieve 0.51070 1.00000 0.4893 ------------------- Mathematica Notebook excerpt, Solution for path regressions, correlation input (* normal eqs from walker-lev ch13 for correls *) (* do data adventure 3 feb '04 *) (* self-esteem eq 3 predictors *) rX = { { 1, .39, -.33}, {.39,1, -.33}, {-.33, -.33, 1}} {{1,0.39,-0.33},{0.39,1,-0.33},{-0.33,-0.33,1}} rY ={.19, .14, -.14} {0.19,0.14,-0.14} bmat = rY.Inverse[rX] {0.142331,0.0603643,-0.0731104} (* checks with p.9 Calis outputin Allison for esteem coeffs Abil Class FamSize *) rY.bmat 0.0457294 (* checks with squared mult correl from Walker-Lev *) (* now do ther eqs *) rX = { { 1, -.33,.39, .14}, {-.33, 1, -.33, -.14}, {.39,-.33, 1, .19}, {.14, -.14, .19, 1.}} {{1,-0.33,0.39,0.14},{-0.33,1,-0.33,-0.14},{0.39,-0.33, 1,0.19},{0.14,-0.14,0.19,1.}} rY ={.43,-.28, .67, .22} {0.43,-0.28,0.67,0.22} bmat2 = rY.Inverse[rX] {0.188708,-0.0163591,0.575433,0.0819582} rY.bmat2 0.489296 rX = { { 1, -.33}, { -.33, 1}} {{1,-0.33},{-0.33,1}} rY = {.39, -.33} {0.39,-0.33} bmat3 = rY.Inverse[rX] {0.315453,-0.225901} rY.bmat3 0.197574 ------------------------------------ END