Stat209/Ed260 May 29, 2007 D Rogosa Take Home Problems #2 Due Tuesday June 5 in class or if you are away Statistics Department Fax (address to Rogosa) Department of Statistics -- Sequoia Hall 390 Serra Mall Stanford University Stanford, CA 94305-4065 Phone: (650) 723-2620 Fax: (650) 725-8977 ----------------------------- Usual Honor Code procedures: You may use any inanimate resources--no collaboration. This work is done under Stanford's Honor Code. Solutions for these problems are to be submitted in hard-copy form. Given that these problems are untimed, some care should be taken in presentation, clarity, format. Especially important is to give full and clear answers to questions, not just to submit unannotated computer output, although relevant output should be included. Added note: Please start each problem on a new page and keep all material for a problem contiguous, It's fine, for example, to blend notebook paper with printed output, just keep it all together. There are three problems this take-home exam ------------------------------ ------------------------------ Problem 1. Comparing regressions Do statisticians have prestige? is left unanswered The data for this question are obtained from the Prestige data set in the car package Prestige of Canadian Occupations Source Canada (1971) Census of Canada. Vol. 3, Part 6. Statistics Canada [pp. 19-1–19-21]. Description The Prestige data frame has 102 rows and 6 columns. The observations are occupations (ranging from cooks to lawyers). ------------------------- This data frame contains the following columns: education Average education of occupational incumbents, years, in 1971. income Average income of incumbents, dollars, in 1971. women Percentage of incumbents who are women. prestige Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s. census Canadian Census occupational code. type Type of occupation. A factor with levels: bc, Blue Collar; prof, Professional,Managerial, and Technical; wc, White Collar. This original data set (before any modifications indicated below) in in the class directory as http://www-stat.stanford.edu/~rag/stat209/prestigeorig.dat -------------------- In the output below I eliminated the 4 rows (occupations) that had missing type, and I combined prof and wc into pwc (so type now has two levels: bc, pwc). From the output below (or check with your own analyses) answer the following, which may also require some additional analyses: a. what is the observed difference between mean prestige in the bc and pwc groups? what is the analysis of covariance estimate (using education as the covariate) of the difference in prestige between bc and pwc? Comment on the similarities or differences between the two results. b. does the difference in prestige for bc vs pwc appear to depend on the level of education? explain. c. allowing difference in prestige for bc vs pwc to depend on the level of education, give a point and interval estimate of the difference at education = 10? Why do you think I picked the value of 10? -------------------- OUTPUT > table(type) type bc pwc 44 54 > as.numeric(type) #makes pwc = 2, bc = 1 [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 [39] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [77] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 > G = as.numeric(type) - 1 > reg2 = lm(prestige ~ G + education) > summary(reg2) Call: lm(formula = prestige ~ G + education) Residuals: Min 1Q Median 3Q Max -22.2519 -5.6830 0.8985 5.7192 16.3343 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.8248 4.5295 -3.935 0.000158 *** G -6.7978 2.8611 -2.376 0.019513 * education 6.3823 0.5204 12.265 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.378 on 95 degrees of freedom Multiple R-Squared: 0.7648, Adjusted R-squared: 0.7598 F-statistic: 154.4 on 2 and 95 DF, p-value: < 2.2e-16 > tapply(education, G, summary) $"0" Min. 1st Qu. Median Mean 3rd Qu. Max. 6.380 7.570 8.350 8.359 8.923 10.930 $"1" Min. 1st Qu. Median Mean 3rd Qu. Max. 9.17 11.15 12.43 12.78 14.52 15.97 > reg4 = lm(prestige ~ G + education+ I(G*education)) > summary(reg4) Call: lm(formula = prestige ~ G + education + I(G * education)) Residuals: Min 1Q Median 3Q Max -19.7095 -6.0449 0.7366 6.3012 16.1411 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.294 9.166 -0.468 0.6406 G -26.337 11.885 -2.216 0.0291 * education 4.764 1.086 4.386 3.02e-05 *** I(G * education) 2.089 1.234 1.693 0.0938 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.297 on 94 degrees of freedom Multiple R-Squared: 0.7717, Adjusted R-squared: 0.7644 F-statistic: 105.9 on 3 and 94 DF, p-value: < 2.2e-16 > reg5a = lm(prestige[type =="pwc"] ~ education[type == "pwc"]) > summary(reg5a) Call: lm(formula = prestige[type == "pwc"] ~ education[type == "pwc"]) Residuals: Min 1Q Median 3Q Max -19.5157 -5.9485 0.7724 6.0602 15.7194 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -30.6307 7.4539 -4.109 0.000141 *** education[type == "pwc"] 6.8525 0.5767 11.882 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.174 on 52 degrees of freedom Multiple R-Squared: 0.7308, Adjusted R-squared: 0.7256 F-statistic: 141.2 on 1 and 52 DF, p-value: < 2.2e-16 > reg5b = lm(prestige[type =="bc"] ~ education[type == "bc"]) > summary(reg5b) Call: lm(formula = prestige[type == "bc"] ~ education[type == "bc"]) Residuals: Min 1Q Median 3Q Max -19.7095 -6.0923 0.5828 6.4920 16.1411 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.294 9.331 -0.460 0.648 education[type == "bc"] 4.764 1.106 4.308 9.7e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 8.447 on 42 degrees of freedom Multiple R-Squared: 0.3064, Adjusted R-squared: 0.2899 F-statistic: 18.56 on 1 and 42 DF, p-value: 9.709e-05 ---------------------------------------------------------------------- Problem 2. Matching exercise with Panel Study of Income Dynamics (PSID). Data used in Lab 3 Mroz87 U.S. Women's Labor Force Participation The Mroz87 data frame contains data on 753 married women. These data are collected within the "Panel Study of Income Dynamics" (PSID). Of the 753 observations, the first 428 are for women with positive hours worked in 1975, while the remaining 325 observations are for women who did not work for pay in 1975. I took these data from an R package and placed in file http://www-stat.stanford.edu/~rag/stat209/Mroz87.dat or you can obtain from installing the micEcon package. ------------------------------- Format This data frame contains the following columns: lfp Dummy variable for labor-force participation. hours Wife's hours of work in 1975. kids5 Number of children 5 years old or younger. kids618 Number of children 6 to 18 years old. age Wife's age. educ Wife's educational attainment, in years. wage Wife's average hourly earnings, in 1975 dollars. repwage Wife's wage reported at the time of the 1976 interview. hushrs Husband's hours worked in 1975. husage Husband's age. huseduc Husband's educational attainment, in years. huswage Husband's wage, in 1975 dollars. faminc Family income, in 1975 dollars. mtr Marginal tax rate facing the wife. motheduc Wife's mother's educational attainment, in years. fatheduc Wife's father's educational attainment, in years. unem Unemployment rate in county of residence, in percentage points. city Dummy variable = 1 if live in large city, else 0. exper Actual years of wife's previous labor market experience. nwifeinc Non-wife income. wifecoll Dummy variable for wife's college attendance. huscoll Dummy variable for husband's college attendance. ------------------------- As a matching exercise try to match the lfp groups (no hours worked vs some hours worked) on the set of covariates. We will limit this exercise to considering 7 matching variables. age educ hushrs huswage faminc mtr motheduc a. first examine group (lfp, no lfp) differences on these 7 background variables. b. Compute a propensity score on lfp status for each of the 753 women using these 7 background variables. Display side-by-side boxplots or other numerical comparisons of the distributions of the propensity scores in each group. Divide the 753 propensity scores into quintiles (~150 each) and display the number of lfp and non-lfp women in each quintile. Is there considerable overlap in the propensity score in the two groups? c. How well does the propensity score function to reduce pre-existing differences on the background variables? Consider two of the matching variables: huswage and motheduc. For each propensity quintile, compute the lfp/no-lfp differences on these two variables. Are the lfp/no-lfp differences on these two variables that existed in the full data set reduced by stratifying on the propensity score? Give some numerical measures. ------------------------------------------------------------------- Problem 3. Short(er) Questions Part I. Non-compliance. Class example 5/15 (handout). Adapted from: An introduction to instrumental variables for epidemiologists, Sander Greenland, International Journal of Epidemiology 2000;29:722-729 Greenland discusses randomized trials with non-compliance where Z indicates treatment assignment, which is randomized; X indicates treatment received, which is affected but not fully determined by assignment Z. To illustrate Greenland presents in his Table 1 individual one- year mortality data from a cluster-randomized trial of vitamin A supplementation in childhood. Of 450 villages, 229 were assigned to a treatment in which village children received two oral doses of vitamin A; children in the 221 control villages were assigned none. This protocol resulted in 12,094 children assigned to the treatment (Z = 1) and 11,588 assigned to the control (Z = 0). Only children assigned to treatment received the treatment; that is, no one had Z = 0 and X = 1. Unfortunately, 2419 (20%) of those assigned to the treatment did not receive the treatment (had Z = 1 and X = 0), resulting in only 9675 receiving treatment (X = 1). Class 5/15 handout has depiction and Greenland's table of results. Use as the outcome measure Y, the Deaths per 100,000 within one year (labeled Risk in Greenland's Table 1). a. Give the ITT (intent-to-treat) estimate of the effect of vitamin A on Risk b. What is the compliance rate in the treatment group (Z=1)? In the control group (Z=0)? c. What is the instrumental variables estimate (following Angrist Imbens Rubin) of the effect of vitamin A on Risk? What interpretation is given to this estimate (c.f. Booil Jo presentation material)? Compare with part (a) result and comment. ---------------------------------- Part II. Consider the simple regression model y = beta_0 + beta_1 x + u and let z be a binary instrumental variable for x. (z=1 for those in group 1 and z=0 for those in group 0). Show that the IV estimator for beta_1 can be written as mean(y | z=1) - mean(y | z=0) _____________________________ mean(x | z=1) - mean(x | z=0) This estimator, known as the grouping estimator in econometrics, dates back to Wold (1940). Extra credit: create a small artificial data set that illustrates/confirms this mathematical result. ---------------------------------- END TH2 ===================== END TH2