Education 257 HW3 Feb 15 2005 ("Due" March 7 2005) Part I. Muliple Regression 1. The file 'hospital.dat' contains data on days hospitalized (X in C1) and a prognosis index (Y in C2) for 15 severely injured patients. A hospital administrator wants to develop a prediction equation for the long term prognosis using the length of the hospital stay. (a) Develop a prediction equation by straightening the scatterplot and using a straight-line fit. Give the fit and an interval estimate for a patient hospitalized 10 days. Repeat for 60 days hospitalization. (b) For this same problem develop a prediction equation for the long term prognosis by fitting a polynomial. Compare the fits and a interval estimate for expected prognosis for a patient hospitalized 10 days from the two approaches-- polynomial fit vs straightening the scatterplot and using a straight-line fit in part a. Repeat the comparison for 60 days hospitalization. =============================================================== 2. Bodyfat data revisited By referring to file bodyfat.out and/or to the output in NWK (or by redoing the analyses), let's use this example to once-again illustrate the vagaries of multiple regression coefficients (and improper attempts to interpret them). Which of the three predictors--triceps X1, thigh X2 or midarm X3-- is the best single predictor of bodyfat? What is the regression coefficient for that predictor in a single predictor eqaution? What is the corresponding t-statistic for that coefficient? Now consider the regression using both triceps and thigh as predictors. Compare the coefficients (and their t-statistics) from this multiple regression with the corresponding single predictor equations. Now consider the multiple regression using all three predictors. For triceps and thigh, compare the coefficients (and their t-statistics) from this multiple regression with the results from the previous regression equations. To decrease bodyfat does one puff up one's thighs? ------------------------------------------------------------ 3. Patient Satisfaction Data are described and listed in Problem NWK 6.15, p.254-5 ver4 (p.251 ver 5). The data reside in file patient.dat From NWK 6.15: A hospital adminstrator wished to study the relation between patient satisfaction Y (in C1) and X1 patients age (in C2), X2 an index of severity of illness (in C3), and X3 anxiety level (in c4) where larger values of Y X2 X3 indicate more satisfaction, more severe illness and more anxiety. Do the following parts of problems: 6.15 a c d 6.16 a b c 6.17 a 9.7 p.393 ver4 or 10.7 p.414 ver5, parts a b also for the fit from 6.15c verify that the regression coefficients can be obtained from straight line fits to the corresponding partial regression plots. Use the coefficient for X2 as your example. -------------------------------------------------------------- 4. IQ scores and reading ability The file readiq.dat contains data (from a text) on 60 elementary school boys, 30 of whom were rated as poor or very poor readers--at least 2 years below grade level. The remaining 30 boys read normally, but otherwise resembled the poor readers in terms of schools, age, family background, and other variables. The 30 boys with reading problems consisted of 11 "very poor" readers and 19 who were merely "poor" readers. In the data file c1= 1 for very poor; c1 = 2 for poor; c1 = 3 for normal. The relation of reading disability to IQ measures is currently seen not to be as simple as "poor readers have lower intelligence". We have in column c4 the full-scale WISC-R IQ score. In c2 we have the attention/concentration sub-scale score (composed of arithmetic, digit-span, coding subtests). In c3 we have the spatial ability sub-scale score (composed of picture completion, block design, object assembly subtests). a) Obtain a scatterplot for the attention/concentration and spatial ability scores with the reading ability level (1,2,3) in c1 used to identify each individual (e.g. c1 = 1 gets an "A" label etc) b) For the normal readers, use the subscale scores in c2 and c3 to form a prediction equation for the full-scale WISC-R scores in c4. What are the coefficients and squared multiple correlation for this regression fit? Plot the residuals versus the fits for this regression. Obtain a 95% prediction interval for the full-scale score for an individual having attention/concentration score of 32 and spatial ability score of 30. ---------------------------------------------------------------- Part II HW3 after next lecture cycle (2/28-) Regression with Group Membership Variables ------------------------------------------ 5. Consider a one-way classification with four levels (I = 4). We are given the population cell means (mu(1) through mu(4)) as: 7, 9, 6, 15. Consider the general linear model setup (with 3 group membership indicators) E(Y|G1,G2,G3) = beta0 + beta1*G1 + beta2*G2 + beta3*G3 where G1 = 1 if treatment 2 G1 = 0 otherwise G2 = 1 if treatment 3 G2 = 0 otherwise G3 = 1 if treatment 4 G3 = 0 otherwise a. Determine the values for the 4 betas in the regression model b. Express mu(3) - mu(2) in terms of the betas. Check by numerical substitution. --------------------------------------------------------------- 6. File salary.dat contains data from a salary survey discussed in lecture: C1 is experience, c2 is education level (1 for HS, 2 for BS, 3 for advanced degree), c3 indicate management position (=1) or not, and c4 is the outcome measure salary. First, code the 3 levels of education using 2 group membership indicators (so that education is not used as an interval scale). In the solutions we use HS as the base --0 0 code. What is the single best predictor of salary? Predict salary using experience, education, and management. Add to the model two management-education interaction terms. Do these terms add significantly to the prediction? Give an interval estimate of the value of an additional year of experience. Repeat for an advanced degree in addition to the BS-- (i.e comparison asked for here is the comparison between advanced and H.S, *not* to indicate I want a differential between advanced deg and B.S. That's a harder thing to do in this coding although it can be done) -------------------------------------------------------------------- 7. (former quiz question) A study of several hundred professors' salaries in a large American university in 1969 (AER, 1973, p.469) yielded the following prediction equation: S = 1900 + 230*B + 18*A + 100*E + 490*D + 190*Y + 50*T - 2400*X where S is annual salary, B is number of books written, A number of ordinary articles, E number of excellent articles, D number of Ph.D.'s supervised, Y years experience, T = 1 if student evaluations above median, 0 otherwise, X = 1 if female, 0 otherwise. For a prof with B=A=E=D=X=1 and Y=5, what's the expected change in salary if she goes from very good to poor student evaluations? Mean salaries were $16,100 for males and $11,200 for females. What is the value of the slope from a simple S on X regression? ------------------------------------------------------------------- Analysis of Covariance and Extension -------------------------------------------------------------------- 8. A researcher is studying the effect of an incentive on the retention of subject matter and is also interested in the role of time devoted to study. Subjects are randomly assigned to two groups, one receiving (C3 = 1) and the other not receiving (C3 = 0) an incentive. Within these groups, subjects are randomly assigned to 5, 10, 15, or 20 minutes of study (C2) of a passage specifically prepared for the experiment. At the end of the study period, a test of retention (C1) is administered. We treat the study time as a covariate for investigating the differential effects of the incentive. Part I: ANCOVA Use the Minitab output below to answer the following questions. (This is a quiz question from prior year) (for reference raw data are in file retention.dat) What is the slope of the C1 on C2 regression line for the 12 subjects in the incentive group? What is the correlation between C1 and C2 for the incentive group? Construct a 99% confidence interval for the analysis of covariance treatment effect. MTB > ancova c1 = c3; SUBC> covariates c2; SUBC> means c3. Analysis of Covariance for C1 Source DF ADJ SS MS Covariates 1 42.008 42.008 C3 1 100.042 100.042 Error 21 30.575 1.456 Total 23 172.625 Covariate Coeff Stdev t-value C2 0.2367 0.0441 5.371 ADJUSTED MEANS C3 N C1 0 12 5.8333 1 12 9.9167 MTB > describe c1-c2; SUBC> by c3. C3 N MEAN MEDIAN STDEV C1 0 12 5.833 5.500 1.850 1 12 9.917 10.000 1.782 C2 0 12 12.50 12.50 5.84 1 12 12.50 12.50 5.84 MTB > let c4 = c2*c3 MTB > regress c1 3 c3 c2 c4 The regression equation is C1 = 2.50 + 4.83 C3 + 0.267 C2 - 0.0600 C4 Predictor Coef Stdev Constant 2.5000 0.8646 C3 4.833 1.223 C2 0.26667 0.06314 C4 -0.06000 0.08929 MTB > regress c1 2 c3 c2 The regression equation is C1 = 2.87 + ???? C3 + ????? C2 Predictor Coef Stdev Constant 2.8750 0.6517 C3 ?????? 0.4926 C2 ??????? 0.04406 ---------------------------------------- Part II CNRL analysis (optional, more next cycle) Now let's look at these data from scratch. The full data are in file retention.dat (as described above) Carry out a full comparing nonparallel regression lines analysis. CNRL paper linked on course outline Obtain a 99% confidence interval for the effect of the incentive for 12.5 minutes of study. ("pick-a-point" procedure) Obtain a 95% simultaneous interval for the effect of the incentive over the entire range of study times. (simultaneous J-N procedure) ================================ END HW3