Education 351B  Autumn 2011
    Statistical issues in testing and assessment


David Rogosa Sequoia 224,   rag{AT}stat{DOT}stanford{DOT}edu
Course web page: http://www-stat.stanford.edu/~rag/ed351B/

Registrar's information
EDUC 351B: Statistical issues in testing and assessment
Seminar
Units: 2-3
Room: 50-52H
Schedule: Monday 3:15-5:05pm
Grading Basis: Letter-S/NC

Course Description:
The new book by Howard Wainer, "Uneducated Guesses: Using Evidence to Uncover 
Misguided Education Policies" is the basis for this seminar. Also included will 
be supporting research literature and data analysis activities for topics 
such as college admissions, methods for missing data, assessment of 
achievement gaps, and the use of value-added analysis.
see http://www-stat.stanford.edu/~rag/ed351B/ 


Text:
Uneducated Guesses: Using Evidence to Uncover Misguided Education Policies.   Howard Wainer (Author)
amazon page    available in paper and Kindle
http://www.amazon.com/Uneducated-Guesses-Evidence-Misguided-Education/dp/0691149283
Publisher: Princeton University Press (August 21, 2011), ISBN-10: 0691149283 ISBN-13: 978-0691149288


Week 1  9/26.   Organization; meet and greet.

In the news
 10/24 (chap 3 content)
 Washington Post: The problems with the PSAT and National Merit program
 NYU Exiting National Merit Scholarship Citing Test Process    NYU Exiting National Merit Scholarship
          Background: ETS PSAT page
  Anecdotal NYC on placement tests as instruments of upward mobility (and gender equity)---Up-and-Crammers
Matt Damon doesn't much like testing either
Chap 8. 13 More Arrested in SAT Cheating Scandal
Chap 9. In Tennessee, Following the Rules for Evaluations Off a Cliff New York Times Nov 6       Teachers Are Put to the Test More States Tie Tenure, Bonuses to New Formulas for Measuring Test Scores Wall St Journal Sept 13
12/7 for Chap 1 PACE report: State Standards, the SAT, and Admission to the University of California

Unit 1
Wainer Chapters 1-3 (and Chap 5, continuation of Chap 2)-- Rebuttal to Report of the Commission on the Use of Standardized Tests in Undergraduate Admission September 2008
In the words of Wainer:
a report, commissioned by the National Association for College Admission Counseling, that was critical of the current college admission exams, the SAT and the ACT. 
The commission was chaired by William R. Fitzsimmons, the dean of admissions and financial aid at Harvard.
The report was reasonably wide-ranging and drew many conclusions while offering alternatives. 
Although well-meaning, many of the suggestions only make sense if you say them fast.
Among their conclusions were:
    Schools should consider making their admissions "SAT optional," that is allowing their applicants to submit their SAT/ACT scores if they wish, 
    but they should not be mandatory. The commission cites the success that pioneering schools with this policy have had in the past as proof of concept.
    
    Schools should consider eliminating the SAT/ACT altogether and substituting instead achievement tests. They cite the unfair effect of coaching as the 
    motivation for this -- they weren't naive enough to suggest that because there was no coaching for achievement tests now that, if they became more high stakes 
    coaching for them would not be offered. Rather, they argued that such coaching would be related to schooling and hence more beneficial to education than is 
    coaching that focuses on test-taking skills.
    
    That the use of the PSAT with a rigid qualification cut-score for such scholarship programs as the Merit Scholarships be immediately halted.
    Wainer posting of early form (Feb 2009) of Chap 1 materials
    comparing the incomparable early version (Dec 1999) of Chap 5
Source Material: "Report of the Commission on the Use of Standardized Tests in Undergraduate Admission" September 2008, National Association for College Admission Counseling.
More NACAC    Preparation for College Admission Exams National Association for College Admission Counseling    National Association for College Admission Counseling Foundations of Standardized Admission Testing
Some commentary on the NACAC report:
Dramatic Challenge to SAT and ACT
In Defense of the SAT, Columbia U
Standardized Tests: Fair or Unfair?

Dick Atkinson on College Admissions testing:
Reflections on a Century of College Admissions Tests  Educational Researcher, Vol. 38, No. 9, pp. 665-676  
      cited Univ of Calif report Validity Of High-School Grades In Predicting Student Success Beyond The Freshman Year: High-School Record vs. Standardized Tests as Indicators of Four-Year College Outcomes
The New SAT: A Test at War with Itself   invited presidential address at the annual meeting of the American Educational Research Association held in San Diego, California on April 15, 2009

Chapter 4 resources
Using the PSAT/NMSQT and Course Grades in Predicting Success in the Advanced Placement Program, Wayne Camara and Roger Millsap College Board Report No. 98-4
Changes in Advanced Placement Test Taking in California High Schools 1998-2003 Richard S. Brown 01-01-2005

Chapter 6-7 resources
On Examinee Choice in Educational Testing. Howard Wainer. Educational Testing Service. David Thissen. University of North Carolina at Chapel Hill REVIEW OF EDUCATIONAL RESEARCH 1994 64: 159
Item Response Theory Models Applied to Data Allowing Examinee Choice   JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICS 1998 23: 236, Eric T. Bradlow and Neal Thomas
Problem Choice by Test Takers, RL Linn, 1998, CRESST Tech Report 485
Quick review of performance assessments
also from Laura Hamilton, SUSE Ph.D
An Investigation of Students' Affective Responses to Alternative Assessment Formats      Construct validity of constructed-responseassessments: Male and female high school science performance       The Search for Value-Added: Assessing and Validating Selected Higher Education Outcomes

Chapter 8
In the MiscPsycho pacakge the functions  cheat   and  wrongprob -- Method to detect excessive similarity in student test responses

Chapter 9
Other versions of the Chap 9 materials Value-Added Models to Evaluate Teachers: A Cry For Help H Wainer, Chance, 2011.         Journal of Consumer Research Vol. 32, No. 2, Sept 2005
More Value-added analysis.
Journal of Educational and Behavioral Statistics Vol. 29, No. 1, Spring, 2004 Value-Added Assessment Special Issue
Value-Added Measures of Education Performance: Clearing Away the Smoke and Mirrors, PACE
LA Times Teacher Ratings, summer 2010        NEPC vs LATimes
J.R. Lockwood, Harold Doran, and Daniel F. McCaffrey. Using R for estimating longitudinal student achievement models. R News, 3(3):17-23, December 2003.
Fitting Value-Added Models in R  Harold C. Doran and J.R. Lockwood
Andrew Gelman on Value-added arithmetic: It's no fun being graded on a curve     more NY  Principals rebel against 'value-added' evaluation
Missing Data and Chap 9 stories
A. Wald from the Boeing Math Group (good pictures, pp.20-24)
R packages and resources.  CRAN Task View: Multivariate Statistics.      AMELIA from Gary King

Chapter 10
Bradley-Terry methods for rankings in R-package BradleyTerry2
Background on forming composites, the classic by H Wainer-- Estimating Coefficients in Linear Models: It Don't Make No Nevermind

Chapter 11
Collection of resources at The International Association for Computerized and Adaptive Testing (IACAT)      original David Weiss CAT site
Computerized Adaptive Testing: A Primer [Hardcover] Howard Wainer
Computing and Technical Resources
The Psychometrics Task View provides an annotated listing of more than you really want for R-packages relevant to educational testing.
One good place to focus is the psych package by Revelle who also has a draft text which covers standard statistics plus specialized measurement topics. Ch 7 is test reliability and Chap 8 IRT
For IRT intro also see   ltm: An R Package for Latent Variable Modeling and Item Response Theory Analyses Dimitris Rizopoulos Journal of Statistical Software November 2006, Volume 17, Issue 5.
The 'MiscPsycho' package by Harold Doran has a number of functions useful for standard psychometrics and some of the topics in the Wainer text.
For more basic R-resources see the Stat209 page.
Technical topics:
   Accuracy of Student Test Scores. See collection at How Accurate are the STAR Scores for Individual Students? An interpretive guide   see Shoe Shooping Example (esp Bundy version), 1999 Accuracy Guide, and NY Times column for illustrations of accuracy vs traditional reliability.
   Comparability calculation (CA AB265) relevant to Wainer Ch.2/5. Test equating, from MiscPsycho package, the SL function for the Stocking Lord Equating Procedure
    Medical Diagnosis, Stat141ex, Bayes Thm relevant for cut-scores and scholarships, Wainer Ch.3.
               Bayes Thm and diagnosis, Jim Berger, Objective Bayes Berger Talk 2005;    Univ Chicago lectures 2011     2006 publication
    SAT coaching (re NACAC and Wainer Ch2) , Stat209, week 8, Ben Hansen analyses. Full matching in an observational study of coaching for the SAT. (Scholastic Assessment Test) Journal of the American Statistical Association; 9/1/2004; Hansen, Ben
A local example of cut-scores and testing: California High Scool Exit Exam. An old set of HSEE cut-score calculations

Exercises
1.Errors in variables examples.
1a. Generate an observed score with reliability .8; reliability .5. Let true scores be distributed N(500, 75).
1b. For a score with reliability .9, plot a scatterplot of observed vs true scores.
   For a true score at the 50th percentile, what is the range of observed scores? Repeat for true score at the 75th percentile?
   What is the conditional distribution of observed scores at these values?
2. Traditional reliability, coefficient alpha. Recreate Table 4, Revelle Ch.7, pp.219-220.
3. Test Equating. Consider Form X with scores found to be distributed N(520, 75) and Form Y with scores N(450, 60) for students in population P   (following the setup in Braun-Holland chapter and handout). Compare equipercentile equating (inverting the cdfs) with linear equating (scale-translation to mean 0, sd 1 and back) for a sample of 1000 students taking X and 1000 taking Y.
4. Cut-scores, diagnosis. Consider a test with scores having error-variance 64. For a student whose true score is 2pts below the cut-off what is the probability of success for that student. For a test with reliability .9, what proportion of students who succeed did not "deserve such". What additional specifications/assumptions did you make to do the calculation?