Education 257 SOLUTIONS HW3 3/08/05 --------------------- 1. This problem is concerned with using hospital stay (X) to estimate prognosis (Y). 15 ROWS READ ROW C1 C2 1 2 54 2 5 50 3 7 45 4 10 37 . . . ****** Straight-line Model ****** MTB > name c1 'Days' MTB > name c2 'P-Index' MTB > plot c2 c1 P-Index - * - * - 45+ * - - - * * - 30+ - - * - * - * 15+ * - * * - * * - * - * 0+ +---------+---------+---------+---------+---------+------Days 0 12 24 36 48 60 MTB > name c22 'Resids' MTB > regress c2 1 c1 c20 c21; SUBC> residuals c22. The regression equation is P-Index = 46.5 - 0.753 Days Predictor Coef Stdev t-ratio Constant 46.460 2.762 16.82 Days -0.75251 0.07502 -10.03 s = 5.891 R-sq = 88.6% R-sq(adj) = 87.7% Analysis of Variance SOURCE DF SS MS Regression 1 3492.1 3492.1 Error 13 451.2 34.7 Total 14 3943.3 MTB > plot c22 c1 - * Resids - * - * - 5.0+ - * * - * - - * 0.0+ - * - * - * - -5.0+ * * - - * * * - - +---------+---------+---------+---------+---------+------Days 0 12 24 36 48 60 Straight-line Model Prediction SUBC> predict 10; SUBC> predict 60. Fit Stdev.Fit 95% C.I. 95% P.I. 38.94 2.18 ( 34.23, 43.64) ( 25.36, 52.51) 1.31 2.67 ( -4.46, 7.08) ( -12.67, 15.29) From the plot of Prognosis Index on Days and that of Residuals a straight-line model is not your best bet. (even though healthy R-square for the straight-line fit) TRANSFORMATION TO STRAIGHTEN From the plot of Prognosis Index vs Days it appears that a log transformation or a square-root transformation on x may "straighten" the relationship between the two. recall Rule of Bulge would indicate Xdown or Ydown ****** Log Transformation ****** MTB > let c4 = loge(c1) here for use in the predict subcommand we need to define our stated X-values (hospital stay) into the new metric MTB > let k1 = loge(10) MTB > let k2 = loge(60) MTB > name c4 'ln-Days' MTB > plot c2 c4 P-Index - * - * - 45+ * - - - * * - 30+ - - * - * - * 15+ * - * * - * * - * - * 0+ ----+---------+---------+---------+---------+---------+--ln-Days 0.70 1.40 2.10 2.80 3.50 4.20 MTB > regress c2 1 c4 c20 c21; SUBC> residuals c22. The regression equation is P-Index = 72.3 - 16.0 ln-Days Predictor Coef Stdev t-ratio Constant 72.283 2.704 26.73 ln-Days -15.9662 0.8386 -19.04 s = 3.241 R-sq = 96.5% R-sq(adj) = 96.3% Analysis of Variance SOURCE DF SS MS Regression 1 3806.8 3806.8 Error 13 136.5 10.5 Total 14 3943.3 Unusual Observations Obs. ln-Days P-Index Fit Stdev.Fit Residual St.Resid 1 0.69 54.000 61.216 2.159 -7.216 -2.99RX R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence. MTB > plot c22 c4 - * - 3.5+ * * - Resids - * * - * - * 0.0+ * * - * - * * - - * -3.5+ * - - - - -7.0+ * - ----+---------+---------+---------+---------+---------+--ln-Days 0.70 1.40 2.10 2.80 3.50 4.20 SUBC> predict k1; SUBC> predict k2. Fit Stdev.Fit 95% C.I. 95% P.I. 35.520 1.054 ( 33.243, 37.796) ( 28.156, 42.883) 6.912 1.202 ( 4.315, 9.509) ( -0.557, 14.381) The plot is closer to straight-line. But the residual plot still shows a pattern. Nontheless the model works better than straight-line fit to the original data. ****** Square-root transformation ****** just for illustration we'll also show you the same stuff for a square-root transformation of hospital stay ( a reasonable choice also) MTB > let c5 = sqrt(c1) MTB > let k3 = sqrt(10) MTB > let k4 = sqrt(60) MTB > name c5 'Days^.5' MTB > plot c2 c5 P-Index - * - * - 45+ * - - - * * - 30+ - - * - * - * 15+ * - * * - * * - * - * 0+ --------+---------+---------+---------+---------+--------Days^.5 1.5 3.0 4.5 6.0 7.5 MTB > regress c2 1 c5 c20 c21; SUBC> residuals c22. The regression equation is P-Index = 63.2 - 7.75 Days^.5 Predictor Coef Stdev t-ratio Constant 63.225 2.271 27.84 Days^.5 -7.7481 0.4097 -18.91 s = 3.262 R-sq = 96.5% R-sq(adj) = 96.2% Analysis of Variance SOURCE DF SS MS Regression 1 3805.0 3805.0 Error 13 138.3 10.6 Total 14 3943.3 MTB > plot c22 c5 - * Resids - - * - * 3.0+ - * - * - * - * * 0.0+ * - - - * - * -3.0+ * - * - * * - - --------+---------+---------+---------+---------+--------Days^.5 1.5 3.0 4.5 6.0 7.5 SUBC> predict k3; SUBC> predict k4. Fit Stdev.Fit 95% C.I. 95% P.I. 38.723 1.171 ( 36.193, 41.254) ( 31.235, 46.212) 3.209 1.357 ( 0.276, 6.141) ( -4.425, 10.842) The residual plot still shows a curvilinear pattern. ====================================== b. Try adding in a quadratic term in Days: ******** Quadratic Model ****** MTB > let c3 = c1*c1 MTB > name c3 'Days^2' MTB > regress c2 2 c1 c3 c20 c21; SUBC> residuals c22. The regression equation is P-Index = 55.8 - 1.71 Days + 0.0148 Days^2 Predictor Coef Stdev t-ratio Constant 55.822 1.649 33.85 Days -1.7103 0.1248 -13.70 Days^2 0.014807 0.001868 7.93 s = 2.455 R-sq = 98.2% R-sq(adj) = 97.9% Analysis of Variance SOURCE DF SS MS Regression 2 3871.0 1935.5 Error 12 72.3 6.0 Total 14 3943.3 SOURCE DF SEQ SS Days 1 3492.1 Days^2 1 378.9 MTB > plot c22 c1 - Resids - * - - * 2.5+ * - - * - * * - * 0.0+ * - - * * * - * - -2.5+ * - * - * - - +---------+---------+---------+---------+---------+------Days 0 12 24 36 48 60 To obtain predictions for 10 and 60 days use PREDICT subc in regress Note here you need to enter x and x-squared for predict to work. SUBC> predict 10 100; SUBC> predict 60 3600. Fit Stdev.Fit 95% C.I. 95% P.I. 40.200 0.921 ( 38.194, 42.207) ( 34.485, 45.915) 6.511 1.292 ( 3.695, 9.328) ( 0.465, 12.558) Examination of the residuals and the squared multiple correlation show that this model fits the data much better. Compare the predictions and interval estimates from the straight-line fit to the original data. The interval estimates for the quadratic model are about one half the width of those from the straight-line model. Moreover for an extreme X value (e.g. 60) the point estimate from the straight-line model is too small as the model does not account for the curvature in the relationship. Fitting the quadratic seems to work slightly better than the transformations. ------------------------------------------------------ 2. Bodyfat data revisited I redid the analyses for this, which are fully available from the text example or from bodyfat.out Just to take up more bytes, recreate the relevant parts here. MTB > read 'bodyfat.dat' c1-c4 20 ROWS READ ROW C1 C2 C3 C4 1 19.5 43.1 29.1 11.9 2 24.7 49.8 28.2 22.8 3 30.7 51.9 37.0 18.7 4 29.8 54.3 31.1 20.1 . . . MTB > name c1 'triceps' c2 'thigh' c3 'midarm' c4 'bodyfat' To find the single best predictor of bodyfat, look at the correlations of each predictor separately with bodyfat: MTB > corr c1-c4 triceps thigh midarm thigh 0.924 midarm 0.458 0.085 bodyfat 0.843 0.878 0.142 The correlations show that thigh circumference is the best single predictor (i.e., the regression of bodyfat on thigh will have a higher R-squared than will the regression of bodyfat on either of the remaining predictors.) Triceps skinfold thickness is a close second. Regression of bodyfat on thigh: MTB > regress 'bodyfat' 1 'thigh' The regression equation is bodyfat = - 23.6 + 0.857 thigh Predictor Coef Stdev t-ratio p Constant -23.634 5.657 -4.18 0.001 thigh 0.8565 0.1100 7.79 0.000 s = 2.510 R-sq = 77.1% R-sq(adj) = 75.8% The regression coefficient for thigh in the single-predictor equation is 0.8565, with a standard error of 0.1100. The corresponding t-statistic is 7.79, (which the p-value of 0.000 shows is significant at any Type I error rate we might choose.) Here's the regression of bodyfat on triceps: MTB > regress 'bodyfat' 1 'triceps' The regression equation is bodyfat = - 1.50 + 0.857 triceps Predictor Coef Stdev t-ratio p Constant -1.496 3.319 -0.45 0.658 triceps 0.8572 0.1288 6.66 0.000 s = 2.820 R-sq = 71.1% R-sq(adj) = 69.5% Triceps also looks like a good predictor when it is used by itself. Regression using both triceps and thigh as predictors: MTB > regress 'bodyfat' 2 'triceps' 'thigh' The regression equation is bodyfat = - 19.2 + 0.222 triceps + 0.659 thigh Predictor Coef Stdev t-ratio p Constant -19.174 8.361 -2.29 0.035 triceps 0.2224 0.3034 0.73 0.474 thigh 0.6594 0.2912 2.26 0.037 s = 2.543 R-sq = 77.8% R-sq(adj) = 75.2% Now the coefficient for thigh is 0.6594, which is fairly close to the value of 0.8565 from the one-predictor equation. The standard error has increased from 0.1100 to 0.2912 (variance increased by a factor of 7), resulting in a decrease in the t-statistic from 7.79 to 2.26. Thus the t-statistic from the 2-predictor equation is almost 3.5 times smaller than that from the 1-predictor equation. The p-value reveals that the coefficient for thigh would not be considered significantly different from 0 if we used a familywise Type I error rate of .05. The coefficient for triceps changes substantially when we move from the 1-predictor to the 2-predictor equation. It decreases from 0.8572 to 0.2224, and its standard error increases from 0.1288 to 0.3034 (variance increases by a factor of about 5.5). The t-statistic decreases from 6.66 to 0.73, so the new t-statistic is 9 times smaller than that from the single-predictor equation. Triceps becomes nonsignificant (p=0.474). Here's the regression using all three predictors. MTB > regress 'bodyfat' 3 'triceps' 'thigh' 'midarm' * NOTE * triceps is highly correlated with other predictor variables * NOTE * thigh is highly correlated with other predictor variables * NOTE * midarm is highly correlated with other predictor variables The regression equation is bodyfat = 117 + 4.33 triceps - 2.86 thigh - 2.19 midarm Predictor Coef Stdev t-ratio p Constant 117.08 99.78 1.17 0.258 triceps 4.334 3.016 1.44 0.170 thigh -2.857 2.582 -1.11 0.285 midarm -2.186 1.595 -1.37 0.190 s = 2.480 R-sq = 80.1% R-sq(adj) = 76.4% Now the coefficient for thigh circumference has become negative: -2.857, as compared with 0.8565 from the single-predictor equation and 0.6594 from the 2-predictor equation. Its standard error is much larger than before (2.582), resulting in a t-statistic that is quite small: -1.11, as compared with 7.79 and 2.26 from the previous two equations. So even though the coefficient is negative, we can't say that it's significantly different from zero (p=0.285). In other words, thigh appears to be a useless predictor once we include midarm and triceps, even though it was the single best predictor of bodyfat when used alone. The coefficient for triceps remained positive (4.334), and is substantially larger than the coefficients for triceps in the 1- and 2- predictor equations (0.8572 and 0.2224, respectively). Its standard error increased quite a bit as well (3.016, compared with 0.1288 and 0.3034 in the previous 2 equations). Its t-statistic is 1.44, which is larger than that for the 2-predictor equation (0.73) but still not significant (p=0.170) despite the increase in the size of the coefficient. ----------------------------------------------------------------- 3. Patient Satisfaction for reference (cf NWK ): C1 = Y (patient satisfaction) C2 = X1 (patient's age) C3 = X2 (severity of illness) C4 = X3 (anxiety level) 6.15 part a MTB > read '/usr/class/ed257/patient.dat' into c1-c4 23 ROWS READ ROW C1 C2 C3 C4 1 48 50 51 2.3 2 57 36 46 2.3 3 66 40 48 2.2 4 70 41 44 1.8 . . . MTB > stem c2-c4 Stem-and-leaf of C2 N = 23 Leaf Unit = 1.0 5 2 89999 5 3 7 3 33 8 3 4 10 3 66 11 3 8 (2) 4 01 10 4 233 7 4 45 5 4 5 4 9 4 5 0 3 5 23 1 5 5 Stem-and-leaf of C3 N = 23 Leaf Unit = 1.0 1 4 3 2 4 4 4 4 66 9 4 88899 (6) 5 000111 8 5 23 6 5 445 3 5 6 2 5 8 1 6 1 6 2 Stem-and-leaf of C4 N = 23 Leaf Unit = 0.010 2 18 00 3 19 0 4 20 0 6 21 00 10 22 0000 (5) 23 00000 8 24 0000 4 25 0 3 26 3 27 3 28 3 29 000 Looking at these stem-and-leaf plots we see that X2 (severity of illness) is approximately normally distributed, X1 (patient's age) is approximately uniformly distributed, and X3 (anxiety level) has several large extreme points at 29. We should remain aware of how these points are influencing the regression equation. 6.15c fit the multiple regression with all linear terms MTB > regress c1 3 c2 c3 c4 c5 c6; SUBC> residuals c7. The regression equation is C1 = 163 - 1.21 C2 - 0.666 C3 - 8.6 C4 Predictor Coef Stdev t-ratio p Constant 162.88 25.78 6.32 0.000 C2 -1.2103 0.3015 -4.01 0.001 C3 -0.6659 0.8210 -0.81 0.427 C4 -8.61 12.24 -0.70 0.490 s = 10.29 R-sq = 67.3% R-sq(adj) = 62.1% Analysis of Variance SOURCE DF SS MS F p Regression 3 4133.6 1377.9 13.01 0.000 Error 19 2011.6 105.9 Total 22 6145.2 SOURCE DF SEQ SS C2 1 3678.4 C3 1 402.8 C4 1 52.4 The estimate for beta(2) in this equation (i.e. -.666) is the sample coefficient for X2 when X1 X2 X3 are used to predict Y. It can be interpreted (through partial regression plots, adjusted variables) as the slope of the regression line produced by the regression of outcome variable (Y) on the variable X2 dot X1 X3, i.e. the adjusted variable X2 with the associations between X2 and X1 and X3 removed. In order to understand this statement more clearly, let's derive the partial regression coefficient for X2, using the procedure shown in class (and the marks examples). The partial regression slope which we are looking for in this case is c1 c3 dot c2 c4 (read Y on X2 dot X1 X3). First step in this procedure will be to regress the outcome variable in c1 upon the predictor variables in c2 and c4. MTB > regress c1 on 2 c2 c4; SUBC> residuals c8. The regression equation is C1 = 147 - 1.24 C2 - 15.9 C4 Predictor Coef Stdev t-ratio p Constant 147.08 16.73 8.79 0.000 C2 -1.2434 0.2961 -4.20 0.000 C4 -15.891 8.256 -1.92 0.069 s = 10.20 R-sq = 66.1% R-sq(adj) = 62.7% Our next step is to regress X2 onto X1 and X3. Residuals from this regression represent the part of in the X2 values not predictable by X1 and X3 values. MTB > regress c3 2 c2 c4; SUBC> residuals c9. The regression equation is C3 = 23.7 + 0.0496 C2 + 10.9 C4 Predictor Coef Stdev t-ratio p Constant 23.728 4.597 5.16 0.000 C2 0.04962 0.08135 0.61 0.549 C4 10.929 2.268 4.82 0.000 s = 2.802 R-sq = 63.8% R-sq(adj) = 60.2% In order to get the partial regression slope, c1 c3 dot c2 c4, we simply regress the residuals from the first regression onto the residuals from the second regression. A plot of c8 vs c9 is the scatterplot for this regression coefficient. MTB > regress c8 1 c9 The regression equation is C8 = 0.00 - 0.666 C9 Predictor Coef Stdev t-ratio p Constant 0.000 2.041 0.00 1.000 C9 -0.6659 0.7809 -0.85 0.403 s = 9.787 R-sq = 3.3% R-sq(adj) = 0.0% You can note that the coefficient of c9 in this equation is identical to the coefficient of c3 in the original full regression equation. The small R-sq for this regression (i.e. .03) suggests that in combination with X1 and X3, X2 is pretty useless in providing additional information about Y. 6.15d In NWK 6.15d, we are asked to produce a boxplot of the residuals from the full regression model. Here we use the simple residuals ei. Recall, that we placed these residuals in c7. MTB > boxp c7 ------------------------ ----------------I + I-------------- ------------------------ +---------+---------+---------+---------+---------+------C7 -18.0 -12.0 -6.0 0.0 6.0 12.0 MTB > stem c7 Stem-and-leaf of C7 N = 23 Leaf Unit = 1.0 1 -1 6 5 -1 4321 7 -0 76 10 -0 310 (6) 0 012234 7 0 577 4 1 2334 Both the boxpot and the stem-and-leaf plot suggest that there are no outliers. 6.16a NWK 6.16a asks us to examine whether or not there is a general regression relation between the three predictor variables (X1, X2 and X3) and the outcome variable (Y). Our null hypothesis in this case is that parameters beta(1), beta(2), and beta(3) all are equal to zero. The alternative hypothesis is that any one of these three coefficients is not equal to zero. The test statistic for this hypothesis test is calculated as follows: MSReg 1377.9 ----- = ------ = 13.01 MSE 105.9 As you may have noticed, minitab calculates this test statistic for you. It should be compared with critical value F (.10;3,19). You can get an exact value for this using the minitab invcdf command. MTB > invcdf .90; SUBC> f 3, 19. 0.9000 2.3970 Since 13.01 > 2.397, we can reject the null hypothesis of no regression relation. As was pointed out in class, this isn't telling us anything very exciting. This test doesn't tell us anything about the individual predictor variables (X1, X2, X3). 6.16b NWK 6.16b asks us to calculate interval estimates for beta(1), beta(2), and beta(3) using an overall confidence coefficient .90. Using the Bonferroni approximation, the confidence coefficient for each interval would be (1 - .10/3) = 1 - .0333 = .9667. Once again we can use the invcdf command to obtain the appropriate critical value: t(.0167; 19). MTB > invcdf .9833; SUBC> t 19. 0.9833 2.293 The interval estimates for each coefficient would be of the form: betahat(k) +/- (2.293) * s.e. betahat(k) The estimates for beta(k) and values for s.e. beta(k) can be gotten directly from the first section of the minitab regression output. The interval estimates are as follows (endpoints rounded to 3 significant digits): For beta(1): -1.21 +/- (2.293)(.3015) = (-1.90, -.520) For beta(2): -.666 +/- (2.293)(.8210) = (-2.55, 1.22) For beta(3): -8.61 +/- (2.293)(12.24) = (-36.7, 19.5) the only CI that does not contain 0 is for beta(1) (as also seen from t-statistics in the reg output). 6.16c NWK 6.16c asks you to calculate the coefficient of multiple correlation. Once again, we can get this information from the minitab output. Minitab gives us the squared multiple correlation (i.e. R-sq); For this regression, R-sq is .673 What NWK call The coefficient of multiple correlation is the positive square root of this value so the coefficient of multiple correlation is r = .82. This is the simple correlation between Y and Yhat, the value you would get from correlating c1 and c6 above. 6.17a NWK 6.17a asks us to calculate a 90% confidence interval for the mean response of Y when X1 = 35, X2 = 45, and X3 = 2.2. A point estimate can be gotten by plugging the specified values for X1, X2, and X3 into the regression equation. Y(hat) = 162.88 - (1.2103)(35) - (.6659)(45) - (8.61)(2.2) = 162.88 - 42.3605 - 29.9655 - 18.942 = 71.612 The easiest way to compute the standard error of the fit (also the point estimate, compare numerical accuracy above with value below) for this set of X values is to use the minitab 'predict' subcommand. MTB > regress c1 3 c2 c3 c4; SUBC> predict 35 45 2.2. The regression equation is C1 = 163 - 1.21 C2 - 0.666 C3 - 8.6 C4 Predictor Coef Stdev t-ratio p Constant 162.88 25.78 6.32 0.000 C2 -1.2103 0.3015 -4.01 0.001 C3 -0.6659 0.8210 -0.81 0.427 C4 -8.61 12.24 -0.70 0.490 s = 10.29 R-sq = 67.3% R-sq(adj) = 62.1% Fit Stdev.Fit 95% C.I. 95% P.I. 71.60 4.44 ( 62.30, 80.90) ( 48.14, 95.06) This automatically gives you a 95% confidence interval for the mean value of Y when X1 = 35, X2 = 45, and X3 = 2.2. However, we wanted a 90% confidence interval. For the 90% confidence interval, we use as the critical value the 95th percentile of the t-distribution with 19 df--i.e. t(.10/2, 19) or t(.05, 19). This value is 1.729. The interval estimate is as follows: Y(hat) +/- t(crit) * s.e.(fit) = 71.6 +/- (1.729)(4.444) = (63.92, 79.28) With 90% confidence, we can say that the mean response of Y when X1 = 35, X2 = 45, and X3 = 2.2 falls between 63.92 and 79.28. ******************************************* NWK 9.7 v4 asks us to prepare a partial regression plot for each of the independent variables (i.e. X1, X2, X3) in the regression equation. Since we already have the two needed sets of residuals for the partial regression plot for X2 as a predictor of Y, using X1 X2 X3, we can simply give the plot command. MTB > plot c8 c9 - * - 12+ * * - * * C8 - * - 2 * - * * * 0+ * * * - - * * - - * * -12+ * - * - * * - - ------+---------+---------+---------+---------+---------+C9 -4.0 -2.0 0.0 2.0 4.0 6.0 We can obtain the regression plots for the other two independent variables in a similar way. For X1, we first need to regress the outcome variable (Y) on X2 and X3 to obtain the set of residuals MTB > regress c1 2 c3 c4; SUBC> residuals c10. The regression equation is C1 = 164 - 1.11 C3 - 20.2 C4 Predictor Coef Stdev t-ratio p Constant 164.23 34.15 4.81 0.000 C3 -1.111 1.078 -1.03 0.315 C4 -20.23 15.76 -1.28 0.214 s = 13.63 R-sq = 39.5% R-sq(adj) = 33.4% Unusual Observations Obs. C3 C1 Fit Stdev.Fit Residual St.Resid 11 48.0 89.00 62.33 5.26 26.67 2.12R 17 56.0 79.00 51.42 4.62 27.58 2.15R R denotes an obs. with a large st. resid. Next we need to regress X1 onto X2 and X3. MTB > regress c2 2 c3 c4; SUBC> residuals c11. The regression equation is C2 = - 1.1 + 0.368 C3 + 9.60 C4 Predictor Coef Stdev t-ratio p Constant -1.12 19.12 -0.06 0.954 C3 0.3681 0.6034 0.61 0.549 C4 9.599 8.823 1.09 0.290 s = 7.632 R-sq = 26.1% R-sq(adj) = 18.8% We can get the partial regression plot for X1 as a predictor by plotting these two sets of residuals. MTB > plot c10 c11 - * C10 - * - - 16+ * - * - * 2 * - - 0+ * - * * * - * - * * 2 * * * - * * -16+ - * - ------+---------+---------+---------+---------+---------+C11 -10.0 -5.0 0.0 5.0 10.0 15.0 Looks like a reasonable scatterplot with a strong negative slope Similarly, for X3 we first regress Y on X1 and X2. MTB > regress c1 2 c2 c3; SUBC> residuals c12. The regression equation is C1 = 167 - 1.26 C2 - 1.09 C3 Predictor Coef Stdev t-ratio p Constant 166.59 24.91 6.69 0.000 C2 -1.2605 0.2892 -4.36 0.000 C3 -1.0893 0.5514 -1.98 0.062 s = 10.16 R-sq = 66.4% R-sq(adj) = 63.1% Next, we regress X3 onto X1 and X2. MTB > regress c4 2 c2 c3; SUBC> residuals c13. The regression equation is C4 = - 0.431 + 0.00582 C2 + 0.0492 C3 Predictor Coef Stdev t-ratio p Constant -0.4314 0.4608 -0.94 0.360 C2 0.005821 0.005350 1.09 0.290 C3 0.04916 0.01020 4.82 0.000 s = 0.1880 R-sq = 65.2% R-sq(adj) = 61.7% Unusual Observations Obs. C2 C4 Fit Stdev.Fit Residual St.Resid 6 49.0 2.9000 2.5085 0.0600 0.3915 2.20R R denotes an obs. with a large st. resid. MTB > plot c12 c13 - * - * 12+ * * - C12 - * * - * * * - * * ** 0+ * - * * - - * - * -12+ * 2 - * - * - - ------+---------+---------+---------+---------+---------+C13 -0.30 -0.15 0.00 0.15 0.30 0.45 Of the three partial regression plots, only the one for X1 reveals a somewhat strong negative association. The other two reveal weak negative associations. This may suggest that either or both of X2 and X3 may not be very useful in improving prediction of Y. From the full multiple regression note that X1 is the only predictor with a big t-statistic. ----------------------------------------------------------------------- 4. a) MTB > read 'readiq.dat' c1-c4 60 ROWS READ ROW C1 C2 C3 C4 1 1 16 41 90 2 1 15 36 84 3 1 15 40 86 4 1 15 35 85 . . . MTB > name c1 'readabil' c2 'attn' c3 'spatial' c4 'wisciq' MTB > lplot c2 c3 c1 40+ - C C attn - C C C C - C C C C C - C 2 C C 2 C 30+ 3 C C C 2 C - C B - B - C B B B - A 2 2 A 20+ A B B B 2 2 2 - B A B - A A A A A - - A 10+ - ----+---------+---------+---------+---------+---------+--spatial 24.5 28.0 31.5 35.0 38.5 42.0 This plot shows no strong linear relationship between the two subscale scores, for all three reading groups combined. There may be seen some visual evidence of a slight negative linear relationship, but it's not very big. Within groups there is also little evidence of strong linear relationship between the subscales. Normal readers tend to perform better on the attention/concentration scale than do the other two groups, but may be at a slight disadvantage on the spatial scale. b) First, select observations for students classified as normal. MTB > copy c2-c4 c5-c7; SUBC> use c1=3. MTB > name c5 'attnnorm' c6 'spatnorm' c7 'iqnorm' MTB > regress c7 2 c5 c6 c8 c9; SUBC> resids c10; SUBC> predict 32 30. The regression equation is iqnorm = 41.6 + 0.748 attnnorm + 1.18 spatnorm Predictor Coef Stdev t-ratio p Constant 41.61 12.87 3.23 0.003 attnnorm 0.7479 0.2878 2.60 0.015 spatnorm 1.1822 0.2602 4.54 0.000 s = 4.541 R-sq = 47.9% R-sq(adj) = 44.1% Analysis of Variance SOURCE DF SS MS F p Regression 2 512.69 256.34 12.43 0.000 Error 27 556.68 20.62 Total 29 1069.37 SOURCE DF SEQ SS attnnorm 1 87.11 spatnorm 1 425.57 Unusual Observations Obs.attnnorm iqnorm Fit Stdev.Fit Residual St.Resid 4 29.0 92.000 101.128 1.182 -9.128 -2.08R 21 24.0 95.000 92.659 2.561 2.341 0.62 X R denotes an obs. with a large st. resid. X denotes an obs. whose X value gives it large influence. Fit Stdev.Fit 95% C.I. 95% P.I. 101.007 0.844 ( 99.274,102.740) ( 91.528,110.486) Coefficient for attention is 0.7479, and for spatial is 1.1822. Squared multiple correlation is 0.479. MTB > plot c10 c9 - * - * * 6.0+ - * * * C10 - * * - * * * - * * 0.0+ * * * - * *** - * * * - * * - * -6.0+ * * - * - - * - +---------+---------+---------+---------+---------+------C9 91.0 94.5 98.0 101.5 105.0 108.5 This looks like random scatter, indicating that the straight-line fit was probably appropriate. There is no sign of unequal error variances, and the only point that seems to be an outlier is the one that Minitab flagged, with fitted value of 101. The variability appears to be greater near the middle of the data, but with so few data points it is difficult to detect whether heteroscedasticity is a problem. 95% prediction interval is obtained from the predict subcommand above. The fit is 101.007, and the 95% P.I. is ( 91.528,110.486). ---------------------------------------------------------------------- 5. Prob #5 part a. Below you will find a summary table of group indicator values for each of the four different treatments. Treatment G1 G2 G3 1 0 0 0 2 1 0 0 3 0 1 0 4 0 0 1 We know that the expected value for any given treatment is simply the population mean for that treatment. Therefore, E(Y|G1=0, G2=0, G3=0) = mu(1) E(Y|G1=1, G2=0, G3=0) = mu(2) E(Y|G1=0, G2=1, G3=0) = mu(3) E(Y|G1=0, G2=0, G3=1) = mu(4) Furthermore, we can represent the expected value for each treatment in terms of general linear model parameters, by plugging the appropriate values for G1, G2, and G3 into the given model. Namely, E(Y|G1,G2,G3) = beta0 + beta1*G1 + beta2*G2 + beta3*G3 Treatment 1 ----------- E(Y|G1=0, G2=0, G3=0) = beta0 + beta1*0 + beta2*0 + beta3*0 = beta0 Treatment 2 ----------- E(Y|G1=1, G2=0, G3=0) = beta0 + beta1*1 + beta2*0 + beta3*0 = beta0 + beta1 Treatment 3 ----------- E(Y|G1=0, G2=1, G3=0) = beta0 + beta1*0 + beta2*1 + beta3*0 = beta0 + beta2 Treatment 4 ----------- E(Y|G1=0, G2=0, G3=1) = beta0 + beta1*0 + beta2*0 + beta3*1 = beta0 + beta3 Hence, mu(1) = beta0 mu(2) = beta0 + beta1 mu(3) = beta0 + beta2 mu(4) = beta0 + beta3 Using the information that mu(1) = 7, mu(2) = 9, mu(3) = 6, mu(4) = 15, we conclude: beta0 = mu(1) = 7 beta1 = mu(2) - mu(1) = 9 - 7 = 2 beta2 = mu(3) - mu(1) = 6 - 7 = -1 beta3 = mu(4) - mu(1) = 15 - 7 = 8 --------------------------- part b. mu(3) - mu(2) = beta0 + beta2 - (beta0 + beta1) = beta2 - beta1 = -1 - 2 = -3 Double-checking these answers using our population cell means: mu(3) - mu(2) = 6 - 9 = -3 It checks out. -------------------------------------------------------------- -------------------------------------------------------------------------- PROBLEM 6 read 'salary.dat' c1-c4 46 ROWS READ ROW C1 C2 C3 C4 1 1 1 1 13876 2 1 3 0 11608 3 1 3 1 18701 4 1 2 0 11283 . . . MTB > name c1 'exp' c2 'ed' c3 'man' c4 'salary' note: to look at categorical variables, MTB table or tally commands will give you relevant counts, relative frequencies etc. Kind of the equivalent of describe command for measured variables. MTB > table c2 ROWS: ed COUNT 1 14 2 19 3 13 ALL 46 MTB > # We need to dummy code the education categories so as to represent three different groups (not levels of an interval variable) make HS the "base" category MTB > code (1 3) 0 (2) 1 c2 c5; > BS=1, Otherwise = 0. MTB > name c5 'ed BS' MTB > table c5 ROWS: ed BS COUNT 0 27 1 19 ALL 46 MTB > code (1 2) 0 (3) 1 c2 c6 MTB > name c6 'ed Ad Deg' >Adv = 1, 0 otherwise MTB > name c6 'ed Adv' To consider the single best predictor of salary, we look at the correlations MTB > corr c1 c3 c4 exp man man -0.051 salary 0.539 0.730 From the above, Corr(salary, management) = 0.730 note that corr(salary, management) is a point biserial correlation (measured variable, salary, correlated with a truly dichotomous variable, managment or not). And for the dummy variable, education, we look at MTB > regress c4 2 c5 c6 The regression equation is salary = 14942 + 3345 ed BS + 3351 ed Adv Predictor Coef Stdev t-ratio Constant 14942 1217 12.27 ed BS 3345 1604 2.09 ed Adv 3351 1754 1.91 s = 4554 R-sq = 10.9% R-sq(adj) = 6.8% Analysis of Variance SOURCE DF SS MS Regression 2 109134646 54567323 Error 43 891962928 20743324 Total 45 1001097576 SOURCE DF SEQ SS ed BS 1 33425874 ed Adv 1 75708771 Unusual Observations Obs. ed BS salary Fit Stdev.Fit Residual St.Resid 42 1.00 27837 18286 1045 9551 2.15R R denotes an obs. with a large st. resid. The correlation here is the sqrt(0.109) = 0.330 Thus it looks like management status is the best predictor of salary, at least among these variables. Predicting salary from all three variables, MTB > regress c4 4 c1 c3 c5 c6 The regression equation is salary = 8036 + 546 exp + 6884 man + 3144 ed BS + 2996 ed Adv Predictor Coef Stdev t-ratio Constant 8035.6 386.7 20.78 exp 546.18 30.52 17.90 man 6883.5 313.9 21.93 ed BS 3144.0 362.0 8.69 ed Adv 2996.2 411.8 7.28 s = 1027 R-sq = 95.7% R-sq(adj) = 95.3% Analysis of Variance SOURCE DF SS MS Regression 4 957816856 239454214 Error 41 43280720 1055627 Total 45 1001097576 And adding two interaction terms for management by education MTB > let c7=c5*c3 MTB > let c8=c6*c3 MTB > name c7 'BS*man' MTB > name c8 'Adv*man' MTB > regress c4 6 c1 c3 c5 c6 c7 c8 The regression equation is salary = 9473 + 497 exp + 3981 man + 1382 ed BS + 1731 ed Adv + 4903 BS*man + 3066 Adv*man Predictor Coef Stdev t-ratio Constant 9472.69 80.34 117.90 exp 496.987 5.566 89.28 man 3981.4 101.2 39.35 ed BS 1381.67 77.32 17.87 ed Adv 1730.7 105.3 16.43 BS*man 4902.5 131.4 37.32 Adv*man 3066.0 149.3 20.53 s = 173.8 R-sq = 99.9% R-sq(adj) = 99.9% Analysis of Variance SOURCE DF SS MS Regression 6 999919408 166653234 Error 39 1178168 30209 Total 45 1001097576 The model with 4 predictors has R-sq = 0.957 The model with 6 predictors has R-sq = 0.999 The change in R-sq = 0.042 The test statistic to use is F = [delta(R-sq)/# variables added]/[(1-R-sq(full))/(N- #variables in full-1)] = (0.042/2) / (1 - 0.999)/(46 - 6 -1) = 819 The critical value is F(0.95, 2,39) = 3.24 So we conclude that the interaction terms add to the prediction of salary significantly. The regression of salary on experience, management and education is MTB > regress c4 4 c1 c3 c5 c6 The regression equation is salary = 8036 + 546 exp + 6884 man + 3144 ed BS + 2996 ed Adv Predictor Coef Stdev t-ratio Constant 8035.6 386.7 20.78 exp 546.18 30.52 17.90 man 6883.5 313.9 21.93 ed BS 3144.0 362.0 8.69 ed Adv 2996.2 411.8 7.28 s = 1027 R-sq = 95.7% R-sq(adj) = 95.3% Analysis of Variance SOURCE DF SS MS Regression 4 957816856 239454214 Error 41 43280720 1055627 Total 45 1001097576 We answer the last two parts using the model without those interaction terms which is basically an arbitrary choice of convenience. Would also be reasonable to use the full models and calculate this increment separately for managers and nonmanagers etc. From the regression coefficient, we see that one more year of experience is worth, on average, $546.18 A 95% confidence interval is thus given by 546.18 +/- t(0.975, 41)*30.52 ===> (484.50, 607.86) The "value" of an advanced degree is $2996.2, so a 95% confidence interval for this estimate is 2996.2 +/- t(0.025, 41)*411.8 ===> (2163.95, 3828.46) Note; Value is calibrated according to the coding of the educational groups. The phrasing of the final question "Repeat for an advanced degree in addition to the BS" can be subject easily to a different reading than I had intended. The comparison between advanced and H.S, is what is done above (and what I had in mind). ---------------------------------------------------------------------- Prob 7. The prediction equation to work from is S = 1900 + 230*B + 18*A + 100*E + 490*D + 190*Y + 50*T - 2400*X The coefficient of T in the regression equation is 50. Note that salary is in an arbitrary 'funny-money' unit. It tells us that there will be a 50 unit decrease in salary for someone whose student evaluations change from very good (T=1) to poor (T=0). This is true no matter what the values of the other variables are (of course this conclusion applies only to this model, with its set of predictors). partial regression coefficient of T here could be interpreted as the average change in salary between the two levels of student evaluation, which is 50. -------------------- To do this problem the HARD way, you can plug in the values of the other variables that the question provides for us (B=A=E=D=X=1 and Y=5) into the regression equation: S = 1900 + 230(1) + 18(1) + 100(1) + 490(1) + 190(5) + 50(T) - 2400(1) = 1288 + 50(T) For a professor with good student evaluations (T = 1): S = 1288 + 50(1) = 1338 For a professor with bad student evaluations (T = 0): S = 1288 So the expected change in salary is 1288-1338 = -50. (That is, a decrease of 50). ------------------------------------ Part b. model for the S on X regression is E(S|X) = beta(0) + beta(1)*X To calculate the value of the estimate of the slope (beta(hat)(1)), first look at the expected values of X within each group and realize that they are equal to the means of S within each group. sample value of E(S|X=0) = beta(hat)(0) = mu(hat)(0) E(S|X=1) = beta(hat)(0) + beta(hat)(1) = mu(hat)(1) Where: mu(hat)(0) = the mean salary for males = 16,100 mu(hat)(1) = the mean salary for females = 11,200 To calculate the slope, beta(hat)(1), we subtract the two group means: beta(hat)(1) = mu(hat)(1) - mu(hat)(0) = 11,200 - 16,100 = -4900 Another way to do this problem is to realize that the line for the regression of S on X will pass through the points (0, 16100) and (1, 11200). The slope = (change in S)/(change in X) = (16100 -11200)/(0 - 1) = -4900. ----------------------------------------------------------------- PROBLEM 8 To calculate the slope for the C1 on C2 regression for the incentive group, we examine the general model: E(C1|C2,C3) = beta(0) + beta(1)*C3 + beta(2)*C2 + beta(3)*(C2*C3) We have the results of this regression in the output for the regression of C1 on C2, C3, and C4 (i.e., C2*C3). >From substituting 0 and 1 for C3, we know that: E(C1|C2, C3=0) = beta(0) + beta(2)*C2 E(C1|C2, C3=1) = [beta(0) + beta(1)] + [beta(2) + beta(3)]*C2 Thus the sample value of the slope for the incentive group (C3=1) is beta(hat)(2) + beta(hat)(3). slope = .267 + (-.06) = .207 To calculate the correlation between C1 and C2 for the incentive group, we use the equation: slope(C1 on C2) = r(C1,C2)*(s(C1)/s(C2)) You've seen this equation written as: b(yx) = r(xy)*(s(y)/s(x)). The question tells us that the standard deviations of C1 and C2 for the incentive group are 1.782 and 5.84, respectively. Solving for r, we get: .207 = r(C1,C2)*(1.782/5.84) or r(C1,C2) = .207*(5.84/1.782) = .678 Treatment Effect Confidence Interval There are a couple of ways to do this problem. (i) The treatment effect is represented by the coefficient of C3 (the group membership variable) in the regression of C1 on C3 and C2. However, it is blocked out in the output and so all we know is the standard error of the treatment effect (.4926). However, we also know that the treatment effect is the difference between the adjusted means in the ancova output. This is 9.9167 - 5.8333 = 4.0834. A 99% confidence interval is (4.0834) +/- (.4926)(t(.995, df = #obs-#parameters) In NWK we look up t(.995, 21) = 2.831. The 99% confidence interval has endpoints 4.0834 +/- 1.395 or (2.689, 5.478). (ii) Another way to calculate the value of the treatment effect is to notice that the F-value for C3 in the ancova output is the square of the t-ratio for C3 in the regression of C1 on C3 and C2. F = MS(C3)/MS(Error) = 100.042/1.456 = 68.71 t = (68.71)**.5 = 8.29 Since we know the standard error of the coefficient of C3 in the regression of C1 on C3 and C2 and its t-ratio, we can solve for the coefficient: coefficient of C3 = treatment effect = 8.29*.4926 = 4.0836 Now we can calculate the confidence interval as we had done above. (iii) use formula for adjusted mean difference: Ybar(1) - Ybar(0) - betahatpooled( Xbar(1) - Xbar(0)) betahatpooled, pooled slope from ancova routine printout is .2367. Ybar() Xbar(), within-group means on outcome and covariate from describe. Substitute values. PART II-- Comparing Nonparallel Regesssion Lines Analysis MTB > read 'retention.dat' c1-c3 24 ROWS READ ROW C1 C2 C3 1 3 5 0 2 4 5 0 3 5 5 0 4 4 10 0 . . . Do a simple description. Note covariate means are identical. Should be close in random assignment of subjects. MTB > describe c1-c2; SUBC> by c3. C3 N MEAN MEDIAN TRMEAN STDEV SEMEAN C1 0 12 5.833 5.500 5.800 1.850 0.534 1 12 9.917 10.000 9.900 1.782 0.514 C2 0 12 12.50 12.50 12.50 5.84 1.69 1 12 12.50 12.50 12.50 5.84 1.69 C3 MIN MAX Q1 Q3 C1 0 3.000 9.000 4.250 7.750 1 7.000 13.000 8.250 11.000 C2 0 5.00 20.00 6.25 18.75 1 5.00 20.00 6.25 18.75 Form interaction term for regression MTB > let c4 = c2*c3 Do the general regression allowing different within group slopes The subcommands pick up useful quantities for the procedures below. Note within group slopes are nearly identical. Therefore treatment effect is only weakly a function of the covariate. MTB > brief 3 MTB > regress c1 3 c3 c2 c4; SUBC> mse k1; SUBC> coefficients c10; SUBC> xpxinv m1. The regression equation is C1 = 2.50 + 4.83 C3 + 0.267 C2 - 0.0600 C4 Predictor Coef Stdev t-ratio p Constant 2.5000 0.8646 2.89 0.009 C3 4.833 1.223 3.95 0.001 C2 0.26667 0.06314 4.22 0.000 C4 -0.06000 0.08929 -0.67 0.509 s = 1.223 R-sq = 82.7% R-sq(adj) = 80.1% Analysis of Variance SOURCE DF SS MS F p Regression 3 142.725 47.575 31.82 0.000 Error 20 29.900 1.495 Total 23 172.625 SOURCE DF SEQ SS C3 1 100.042 C2 1 42.008 C4 1 0.675 Obs. C3 C1 Fit Stdev.Fit Residual St.Resid 1 0.00 3.000 3.833 0.591 -0.833 -0.78 2 0.00 4.000 3.833 0.591 0.167 0.16 3 0.00 5.000 3.833 0.591 1.167 1.09 4 0.00 4.000 5.167 0.387 -1.167 -1.01 5 0.00 5.000 5.167 0.387 -0.167 -0.14 6 0.00 6.000 5.167 0.387 0.833 0.72 7 0.00 5.000 6.500 0.387 -1.500 -1.29 8 0.00 6.000 6.500 0.387 -0.500 -0.43 9 0.00 8.000 6.500 0.387 1.500 1.29 10 0.00 7.000 7.833 0.591 -0.833 -0.78 11 0.00 8.000 7.833 0.591 0.167 0.16 12 0.00 9.000 7.833 0.591 1.167 1.09 13 1.00 7.000 8.367 0.591 -1.367 -1.28 14 1.00 8.000 8.367 0.591 -0.367 -0.34 15 1.00 9.000 8.367 0.591 0.633 0.59 16 1.00 9.000 9.400 0.387 -0.400 -0.34 17 1.00 10.000 9.400 0.387 0.600 0.52 18 1.00 11.000 9.400 0.387 1.600 1.38 19 1.00 8.000 10.433 0.387 -2.433 -2.10R 20 1.00 11.000 10.433 0.387 0.567 0.49 21 1.00 12.000 10.433 0.387 1.567 1.35 22 1.00 10.000 11.467 0.591 -1.467 -1.37 23 1.00 11.000 11.467 0.591 -0.467 -0.44 24 1.00 13.000 11.467 0.591 1.533 1.43 R denotes an obs. with a large st. resid. MTB > print k1 K1 1.49500 Here's the variance-covariance matrix for the betahat MTB > multiply k1 m1 m2 MTB > print m2 MATRIX M2 0.74750 -0.74750 -0.04983 0.04983 -0.74750 1.49500 0.04983 -0.09967 -0.04983 0.04983 0.00399 -0.00399 0.04983 -0.09967 -0.00399 0.00797 Now D(X), the sample estimate of the treatment effect as a function of the covariate is D(X) = 4.83 - 0.0600*X which you can see does not strongly change for the various values of X (which ranges between 5 and 20). (4.5 to 3.6) ------------------ Inferences for D(X) Pick-a-point for X=12.5 (note that this also corresponds to ancova estimate as Ca = 12.5-- e.g. .09967/.00797) I di the arithmetic and substitutions in Mathematica (rather than a calculator) so I'd a record to include in the solutions (Here d[X] is our D(X) etc) In[3]:= d[X] Out[3]= 4.833 - 0.06 X In[13]:= d[12.5] Out[13]= 4.083 In[4]:= vard[X] 2 Out[4]= 0.248562 + 0.00797 (-12.5056 + X) In[14]:= Sqrt[vard[12.5]] Out[14]= 0.49856 So 99% confidence interval for Delta(12.5) has endpoints 4.083 +/- 2.845*.49856 4.083 +/- 1.418 ------------------------ Construct Simultaneous confidence interval for Delta(X) Confidence bands have form D(X) +/- Sqrt[2*fCrit220*varD[X]] In[4]:= vard[X] 2 Out[4]= 0.248562 + 0.00797 (-12.5056 + X) In[7]:= fCrit220 = 3.49 Out[7]= 3.49 I used these table command to get the values of the confidence functions for X values from 0 to 30. In[9]:= Table[{ix, d[ix] + Sqrt[2*fCrit220*vard[ix]]}, {ix, 0, 30, 5.}] In[11]:= Table[{ix, d[ix] - Sqrt[2*fCrit220*vard[ix]]}, {ix, 0, 30, 5.}] X D(X) Lower CI Upper CI 0 4.833 1.60266 8.06334 5. 4.533 2.32644 6.73956 10. 4.233 2.78931 5.67669 15. 3.933 2.4904 5.3756 20. 3.633 1.42858 5.83742 25. 3.333 0.105091 6.56091 30. 3.033 -1.29838 7.36438 Throughout the range of these data the confidence bands for Delta(X) do not include 0. I.e. all observed values of X (5 to 20) are in the Region of Significance. As we saw from part I ancova properly reports a strong treatment effect For reference the ancova routine for these data MTB > ancova c1 = c3; SUBC> covariates c2. Factor Levels Values C3 2 0 1 Analysis of Covariance for C1 Source DF ADJ SS MS F P Covariates 1 42.008 42.008 28.85 0.000 C3 1 100.042 100.042 68.71 0.000 Error 21 30.575 1.456 Total 23 172.625 Covariate Coeff Stdev t-value P C2 0.2367 0.0441 5.371 0.000 -------------------- -------------------------------------------------------------------- END HW3 solutions