Education 257 FINAL PROBLEMS, Spring 2005 JUNE 1, 2005
Solutions for these problems are to be submitted in hard-copy
form. Given that these problems are untimed, some care should be
taken in presentation, clarity, format. Especially important is
to give full and clear answers to questions, not just to submit
unannotated computer output, although relevant output should
be included.
You may use any inanimate resources--no collaboration. This
work is done under Stanford's Honor Code.
Please read the questions carefully and answer the question that
is asked.
Papers will be scored into 3 categories: "Excellent" indicates
successful completion of all parts of all questions (within
perhaps one or two very trivial arithmetic errors);
"Satisfactory" indicates a good attempt was made at all parts of
all problems, but there were some serious errors or omissions;
"Incomplete" indicates inadequate effort or performance.
Place completed hard copy in Rogosa's Cubberley or Sequoia Hall mailbox
by 5PM Friday 6/10
Data sets: I took the extra effort to link data directly
from this assignment document
data reside in the class HW directory
URL is http://www-stat.stanford.edu/~rag/ed257/hw/[file]
One loose end: Course Evaluations for students not in the School
of Education. I will place a few blank forms in my Sequioa mailbox
and these can be returned to the Regisrrar's office, if you have not
already done one.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 1, Model Building, Variable Selection
Can anyone do math? Was it your Parents fault?
Relation of educational achievement of students to the home
environment. Data on average mathematics proficiency (MATHPROF) and
the home environment variables were obtained from the 1990 National
Assessment of Educational Progress for 37 states, the District of
Columbia, Guam, and the Virgin Islands.
In file mathnaep.dat mathnaep.dat
in the course HW directory.
the educational achievement of eighth-grade
students in mathematics and the fol1owing five explanatory variables
(all state-level variables):
C1 MATHPROF average mathematics proficiency
C2 PARENTS percentage of eighth-grade students with both parents living at home
C3 HOMELIB percentage of eighth-grade students with three or more types of
reading materials at home (books, encyclopedias, magazines, newspapers)
C4 READING percentage of eighth-grade students who read more than 10
pages a day
C5 TVWATCH percentage of eighth-grade students who watch TV for six
hours or more per day
C6 ABSENCES percentage of eighth-grade students absent three days or
more last month
a. Start with basic data analysis due diligence. Examine scatterplots
for anomalous observations and for curvature. Any transformations needed?
Obtain a correlation matrix of the predictor variables and outcome.
What is the single best predictor of mathprof?
b. Use best-subsets regression methods to identify useful prediction models?
What is your best candidate? Compare with the second best candidate.
For your best model, comment on the observations with the largest
standardized residuals.
c. Compare your results in part b with the use of Forward Stepwise regression
to determine a prediction model.
d. For the full set of predictor variables, are there any logical candidates
for data reduction (i.e. forming composites). Will any improvements in the
regression fits be obtained from using a composite?
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 2 "Simple" Contingency Tables
Cheating Father Time. In the SF Chronicle May 20, 2001 the feature
"Cheating Father Time: Training, nutrition and medical advances
prolonging careers" provides the following data on the increasing
longevity of professional athletes.
1990
Number of Players Percent
35 and older players 35 and older
League
Major League Baseball 94 8.4%
National Football League 12 1%
National Basketball Association 14 3.6%
National Hockey League 14 2.4%
2000
Number of Players Percent
35 and older players 35 and older
League
Major League Baseball 162 11.7%
National Football League 44 2.7%
National Basketball Association 41 9.3%
National Hockey League 56 7.8%
a. For each of the four leagues construct a 2x2 table: player age
(35 and older, under 35) and year (1990, 2000). For each table
calculate the relative risk of playing (at or) past 35 in the
two decades.
b. Consider the year 2000 data. For the 2x4 table of player
age by sport, test the null hypothesis of independence. Explain
what that null hypothesis actually is saying. Construct a
display of actual counts, expected counts under independence,
and adjusted residuals from the independence model for each cell
in the 2x4 structure.
c. Calculate the following probability:
Given that a professional athlete in one of these four
leagues is still playing in the year 2000 at age 35 or over,
what's the probability he's a baseball player? Do you
have all the information you need to calculate this
probability?
d. Let's do a meta-analysis. Consider the four leagues as four
separate studies. Estimate the overall odds ratio for the 2x2 tables
in part a. Give a point estimate of the overall odds ratio and carry
out a test that the overall odds ratio is different from 1.0
(independence of year and playing past 35)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 3 Modeling Multivariate Categorical Data
But would you want to matriculate?
We consider data on admissions for Fall 1973 graduate study at
U.C. Berkeley in the six largest departments. These data among others
were the subject of extensive litigation on gender discrimination
a few years back.
The data on each applicant consists of the applicants gender (G),
whether admitted (A) and major department (D).
Whether admitted, male Whether admitted, female
Dept Yes No Yes No
a 512 313 89 19
b 353 207 17 8
c 120 205 202 391
d 138 279 131 244
e 53 138 94 299
f 22 351 24 317
a) To start, construct the marginal AG table (a 2x2 table of gender by admit
status). Carry out a test for independence and obtain a point and
interval estimate the odds ratio for admittance for this marginal AG table
What might this result be taken to indicate about gender equity etc in the
admit process? Are you outraged yet?
b. Now use the breakdown by department. Obtain the odds ratio for admittance
within each of the 6 departments. Does Simpson's paradox appear to be present
in these data? Why or why not?
c. Use Cochran-Mantel-Haenszel procedures to:
test whether conditional independence holds for AG
estimate a common odds ratio for the six departments
use Breslow-Day statistic to test whether the AG odds-ratio
is the same for the 6 departments
d. For the possible A G D log-linear models, which model terms
would indicate gender discrimination?
e. Fit the set of A G D log-linear models using a procedure such as
SAS Proc Genmod, and identify what you regard as the most appropriate
model. Does this model confirm gender discrimination in admissions?
Examine the log-likelihood chi-square and table the fits and adjusted
residuals for this model. Are you satisfied with this model?
f. Set aside department a and rerun the log-linear model analysis.
Interpret your preferred model in terms of gender discrimination
in admissions. Also comment on the admissions preferences in dept a.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Problem 4 --Prediction of Binary Outcomes
If you're not crazy yet, you'll do ok.
A psychologist conducted a study to examine the nature of the relation
if any, between an employee's emotional stability (C2) and the
employee's ability to perform in a task group (C1). Data on 27 employees
are in file stable.dat.
stable.dat
in the course HW directory
Emotional stability was measured by a written test, and ability to perform
in a task group (C1 = 1 if able, C1 = 0 if unable) was evaluated by the
supervisor.
a. From an OLS fit for a straight-line relation for predicting C1 from C2,
what level of emotional stability seems necessary for a probability of
successful performance of .70.
b. Carry out a fit of a logistic response function to these data
What is the predicted probability of success for an employee with the
median value of emotional stability?
For the logistic fit, what level of emotional stability seems necessary
for a probability of successful performance of .75?
c. For both the OLS regression and the logistic curve estimation,
list the fitted-values for probability of success using the emotional
stability values in these data (C2). Comment on the
similarity of these two fits.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
END 257 !