Education 351B Autumn 2012
Statistical issues in testing and assessment
David Rogosa
Sequoia 224, rag{AT}stat{DOT}stanford{DOT}edu
Course web page: http://www-stat.stanford.edu/~rag/ed351B/
For Autumn 2011 complete materials
go here
Registrar's information
EDUC 351B: Statistical issues in testing and assessment
Seminar
Units: 2-3
Room: ECON139 Landau Economics Building https://webviewer.collegenet.com/wv3_servlet/stanford/urd/run/wv_space.Detail?RoomID=29
Schedule: Monday 3:15-5:05pm
Grading Basis: Letter-S/NC
Course Description:
The new book by Howard Wainer, "Uneducated Guesses: Using Evidence to Uncover
Misguided Education Policies" is the basis for this seminar. Also included will
be supporting research literature and data analysis activities for topics
such as college admissions, methods for missing data, assessment of
achievement gaps, and the use of value-added analysis.
see http://www-stat.stanford.edu/~rag/ed351B/
Text:
Uneducated Guesses: Using Evidence to Uncover Misguided Education Policies.
Howard Wainer (Author)
amazon page available in paper and Kindle
http://www.amazon.com/Uneducated-Guesses-Evidence-Misguided-Education/dp/0691149283
Publisher: Princeton University Press (August 21, 2011), ISBN-10: 0691149283 ISBN-13: 978-0691149288
Wainer youtube video from PUPress--Howard Wainer critiques misguided education policies
Week 1 9/24. Organization; meet and greet.
Followup to Discussion in class mtg of reliability vs accuracy
Shoe Shopping Example (esp Bundy version),
Materials on Accuracy of Student Test Scores. See collection at How Accurate are the STAR Scores for Individual Students? An interpretive guide
see 1999 Accuracy Guide, and NY Times column for illustrations of accuracy vs traditional reliability.
In the news, 9/24
SAT reading scores hit a four-decade low Wash Post SAT Scores Show Many Students Are Not Ready for College USNews Debating the Value of SAT Scores USNews
SAT Report: Only 43 Percent of 2012 College-Bound Seniors Are College Ready College Board
Week 2 10/1
Begin Unit 1. Wainer vs NACAC. Wainer Chap 1 (or online), SAT optional?, Full Unit 1 materials (lots) below
Week 3 10/8
Continue the 'Begin Unit 1'. Wainer vs NACAC. Wainer Chap 1 (or online), SAT optional?, Try to start Wainer Chap 2/5 achievement tests and test equating. Full Unit 1 materials (lots) below
And yes colleges actually do cheat to appear better on USNews rankings: Emory University Acknowledges Incorrect Admissions Data, NYT Aug 2012.
In the news, October 2012
What is the Best University in America? Comprehensive rundown of the various University rankings (c.f. Wainer Ch. 10)
Want to Ruin Teaching? Give Ratings By deborah kenny, NYT 10/16
Value-added analysis (Ch9) comes to Florida (last year NYC). VAM, the new teacher evaluation system, stirs concern, confusion. 2012 Tampa Bay Times
10/22 NY Times (from Ben) Part 1: Answers to Readers' Questions About the SAT and ACT By TANYA ABRAMS
10/26 NY Times For Asians, School Tests Are Vital Steppingstones
Week 4 10/15
Continue Unit 1. Wainer vs NACAC (and friends such as Dick Atkinson). More on empirical studies of the usefulness of the SAT (some student presentations).
Start Chap 2/5 topics. Use of achievement tests instead of SAT.
Test Equating handout. California K-12 testing comparability saga. Comparability calculation (CA AB265) relevant to Wainer Ch.2/5.
Test equating, from R MiscPsycho package, the SL function for the Stocking Lord Equating Procedure
SAT coaching (re NACAC and Wainer Ch2) , Stat209, week 8, Ben Hansen analyses. Full matching in an observational study of coaching for the SAT. (Scholastic Assessment Test) Journal of the American Statistical Association; 9/1/2004; Hansen, Ben
Week 5 10/22
Continue Unit 1 journey and cognate topics.
Statistical topic: Multiple regression and confounding variables, with examples.
Empirical studies for SAT: Wainer vs NACAC (and friends such as Dick Atkinson). More on empirical studies of the usefulness of the SAT.
Week 6 10/29
Chap 2/5 technical topics (Use of achievement tests instead of SAT).
Test Equating handout. California K-12 testing comparability saga. Comparability calculation (CA AB265) relevant to Wainer Ch.2/5.
Test equating, from R MiscPsycho package, the SL function for the Stocking Lord Equating Procedure. IRT equating procedures: Weeks, J. P. (2009). plink: IRT separate calibration linking methods
Non-IRT alternatives R-package equate vignette Statistical Equating Methods Anthony Albano January 7, 2011 Kernel and Traditional Equipercentile Equating With Degrees of Presmoothing Paul Holland April 2007 EQUATING TEST SCORES (without IRT) Samuel A. Livingston 2004. A very good review of equating error in observed-score test equating: Evaluating Equating Error in Observed-Score Equating Wim J. van der Linden
University of Twente, Enschede, The Netherlands
Law School Admission Council Computerized Testing Report 04-03 July 2006
Ch 3 technical topic: Cut scores and diagnostic testing. Stat141 handout false-positive, false negative Medical Diagnosis (SW text section ex3.17)
2011 in the news and PSAT background
Washington Post: The problems with the PSAT and National Merit program
NYU Exiting National Merit Scholarship Citing Test Process
Background: ETS PSAT page
Technical Resources:
Bayes Thm relevant for cut-scores and scholarships, Wainer Ch.3.
Bayes Thm and diagnosis, Jim Berger, Objective Bayes Berger Talk 2005;
Univ Chicago lectures 2011 2006 publication
For IRT intro see ltm: An R Package for Latent Variable Modeling and Item Response Theory Analyses
Dimitris Rizopoulos Journal of Statistical Software November 2006, Volume 17, Issue 5.
The Psychometrics Task View provides an annotated listing of more than you really want for R-packages relevant to educational testing.
One good place to focus is the psych package by Revelle who also has
a draft text which covers standard statistics plus specialized measurement topics. Ch 7 is test reliability and Chap 8 IRT
The 'MiscPsycho' package by Harold Doran has a number of functions useful for standard psychometrics and some of the topics in the Wainer text.
Unit 1 Materials Core and background
Wainer Chapters 1-3 (and Chap 5, continuation of Chap 2)-- Rebuttal to Report of the Commission on the Use of Standardized Tests in Undergraduate Admission September 2008
In the words of Wainer:
a report, commissioned by the National Association for College Admission Counseling, that was critical of the current college admission exams, the SAT and the ACT.
The commission was chaired by William R. Fitzsimmons, the dean of admissions and financial aid at Harvard.
The report was reasonably wide-ranging and drew many conclusions while offering alternatives.
Although well-meaning, many of the suggestions only make sense if you say them fast.
Among their conclusions were:
Schools should consider making their admissions "SAT optional," that is allowing their applicants to submit their SAT/ACT scores if they wish,
but they should not be mandatory. The commission cites the success that pioneering schools with this policy have had in the past as proof of concept.
Schools should consider eliminating the SAT/ACT altogether and substituting instead achievement tests. They cite the unfair effect of coaching as the
motivation for this -- they weren't naive enough to suggest that because there was no coaching for achievement tests now that, if they became more high stakes
coaching for them would not be offered. Rather, they argued that such coaching would be related to schooling and hence more beneficial to education than is
coaching that focuses on test-taking skills.
That the use of the PSAT with a rigid qualification cut-score for such scholarship programs as the Merit Scholarships be immediately halted.
Wainer posting of early form (Feb 2009) of Chap 1 materials
comparing the incomparable early version (Dec 1999) of Chap 5
Source Material: "Report of the Commission on the Use of Standardized Tests in Undergraduate Admission"
September 2008, National Association for College Admission Counseling.
More NACAC Preparation for College Admission Exams National Association for College Admission Counseling National Association for College Admission Counseling Foundations of Standardized Admission Testing
Some commentary on the NACAC report:
Dramatic Challenge to SAT and ACT
In Defense of the SAT, Columbia U
Standardized Tests: Fair or Unfair?
Dick Atkinson on College Admissions testing:
Reflections on a Century of College Admissions Tests
Educational Researcher, Vol. 38, No. 9, pp. 665-676
cited Univ of Calif report Validity Of High-School Grades In Predicting Student Success Beyond The Freshman Year: High-School Record vs. Standardized Tests as Indicators of Four-Year College Outcomes
The New SAT: A Test at War with Itself invited presidential address at the annual meeting of the American Educational Research Association
held in San Diego, California on April 15, 2009
A more substantial regression exercise with SAT and GPA: SAT Scores, High Schools, and Collegiate Performance Predictions Jesse Rothstein Princeton University
Most recent comprehensive item on SAT, SES etc
Psychological Science 2012 23: 1000 originally published online 2 August 2012.
Paul R. Sackett, Nathan R. Kuncel, Adam S. Beatty, Jana L. Rigdon, Winny Shen and Thomas B. Kiger. The Role of Socioeconomic Status in SAT-Grade Relationships and in College Admissions Decisions
This has cites to the Rothstein paper and to earlier Geiser et al UC Presidents Office studies
12/7/11 for Chap 1,2. PACE report: State Standards, the SAT, and Admission to the University of California
For better or worse, these analyses are taken seriously by the University of California administration and Regents: ADMISSIONS TESTS AND UC PRINCIPLES FOR ADMISSIONS TESTING: A Report from the Board of Admissions and Relations with Schools (BOARS)
The following seems to be the source data analysis document: Agronow, S., and Studley, R., 2007, Prediction of college GPA from new SAT test scores - a first look. Annual meeting of the California Association for Institutional Research (CAIR), Nov 16, 2007
Week 7 11/5
1. Technical topics: IRT intro
see ltm: An R Package for Latent Variable Modeling and Item Response Theory Analyses
Dimitris Rizopoulos Journal of Statistical Software November 2006, Volume 17, Issue 5. Riz ltm talk at useR! 2008 Manuals: ltm mirt
LSAT basics data analysis handout
Revelle who has a draft text which covers standard statistics plus specialized measurement topics. Ch 7 is test reliability and Chap 8 is IRT
Revelle also did the R-package, psych: psych package
The Psychometrics Task View provides an annotated listing of more than you really want for R-packages relevant to educational testing.
2. Wainer Chap 4.
Chapter 4 resources
Earlier version of Chap 4: Educational Psychology Review Volume 12, Number 2 (2000), 201-228, The Aptitude-Achievement Function: An Aid for Allocating Educational Resources, with an Advanced Placement Example William Lichten and Howard Wainer
Using the PSAT/NMSQT and Course Grades in Predicting Success in the Advanced Placement Program, Wayne Camara and Roger Millsap College Board Report No. 98-4
College Board AP resources supplements
Changes in Advanced Placement Test Taking in California High Schools 1998-2003 Richard S. Brown 01-01-2005
November In the news
More on Value-added: A better way to grade teachers By Linda Darling-Hammond and Edward Haertel. NY unions ad
Proficiency as a function of race: Firestorm Erupts Over Virginia's Education Goals Florida Passes Plan For Racially-Based Academic Goals
Teachers also cheat (cf Chap 8). (11/26) Feds: Teachers embroiled in test-taking fraud Test cheating probe nets former educator TN, other teachers embroiled in test-taking fraud, feds say
Week 8 11/12
Wainer content: Examineee Choice (Chap 6, 7)
Continue IRT examples
Chapter 6-7 resources
On Examinee Choice in Educational Testing. Howard Wainer. Educational Testing Service. David Thissen. University of North Carolina at Chapel Hill REVIEW OF EDUCATIONAL RESEARCH 1994 64: 159
Item Response Theory Models Applied to Data Allowing Examinee Choice JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICS 1998 23: 236, Eric T. Bradlow and Neal Thomas
Problem Choice by Test Takers, RL Linn, 1998, CRESST Tech Report 485
Quick review of performance assessments
also from Laura Hamilton, SUSE Ph.D
An Investigation of Students' Affective Responses to Alternative Assessment Formats
Construct validity of constructed-responseassessments: Male and female high school science performance Laura S. Hamilton (1999): Detecting Gender-Based Differential
Item Functioning on a Constructed- Response Science Test, Applied Measurement in
Education, 12:3, 211-235 The Search for Value-Added: Assessing and Validating Selected Higher Education Outcomes
Week 9 11/26
Wainer Chapter 9, Value-added analyses
Other versions of the Chap 9 materials Value-Added Models to Evaluate Teachers: A Cry For Help H Wainer, Chance, 2011.
More Value-added analysis.
Journal of Educational and Behavioral Statistics Vol. 29, No. 1, Spring, 2004 Value-Added Assessment Special Issue
Value-Added Measures of Education Performance: Clearing Away the Smoke and Mirrors, PACE
LA Times Teacher Ratings, summer 2010 NEPC vs LATimes
J.R. Lockwood, Harold Doran, and Daniel F. McCaffrey. Using R for estimating longitudinal student achievement models. R News, 3(3):17-23, December 2003.
Fitting Value-Added Models in R Harold C. Doran and J.R. Lockwood
Andrew Gelman on Value-added arithmetic: It's no fun being graded on a curve more NY Principals rebel against 'value-added' evaluation (from Ben) Some VAM results for NYC
The Don't do VAM letter to NY from Stanford and friends
Missing Data and Chap 9 stories
A. Wald from the Boeing Math Group (good pictures, pp.20-24)
R packages and resources. 1. Missing data Stat222 class handout, imputation and analysis using mice
R resources.
Multivariate Analysis Task View, Missing data section, esp packages mice and mi
van Buuren S and Groothuis-Oudshoorn K (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. see also multiple imputation online
Week 10 12/3
Wainer Chap 10, Chap 11
Chapter 10 resources
Shopping for Colleges When What We Know Ain't Journal of Consumer Research Vol. 32, No. 2, Sept 2005
Avery at al A Revealed Preference Ranking of U.S. Colleges and Universities , NBER WP10803
Bradley-Terry methods for rankings in R-package BradleyTerry2
Background on forming composites, the classic by H Wainer-- Estimating Coefficients in Linear Models: It Don't Make No Nevermind
Firth, D. (2005) Bradley-Terry models in R. Journal of Statistical Software, 12(1), 1-12.
Turner, H. and Firth, D. (2012) Bradley-Terry models in R: The BradleyTerry2 package. Journal of Statistical Software, 48(9), 1-21.
Chapter 11 resources
Collection of resources at The International Association for Computerized and Adaptive Testing (IACAT) esp linked Rudner tutorial original David Weiss CAT site
Computerized Adaptive Testing: A Primer [Hardcover] Howard Wainer
Exercises, Ed351B 2012
1. Anna conjecture #2 (SES eliminates relation between SAT and GPA)
We have a number of published summaries (ETS for national data, UC presidents office and friends for California data, with various subpopulations). Use those sources to calculate
2 or 3 versions of the partial correlation (SAT,GPA|SES) using standard formulas presented in class. Give a conclusion (what would you report?)
2. Partial correlation, artificial data
Artificial data in file http://www-stat.stanford.edu/~rag/ed351B/ex2F.dat 200 cases 3 variables SAT GPA ses.
Obtain correlations and pairwise scatterplots for the 3 vars. The "regreesion way" of asking whether SES matters in the discussions
we've had would be to add SES to a prediction of GPA from SAT. Carry that out. Similarly, a partial correlation (SAT,GPA|SES) could be computed
as in problem 1 or better obtain a scatterplot and the partial correlation from the adjusted variable approach discussed in class.
Do it both ways and verify.
Instead of those approaches just ask the simple (stratification) question: for High SES kids what is the GPA,SAT correlation? and
for Low SES kids what is the GPA,SAT coorelation? (SES here is 1 to 5, with median value 3, i.e. top 100 bottom 100 split at 3.
Get these correlations and look at the relevant scatterplots. Does the stratification give you a clearer picture? Does it agree
with the first set of analyses?
3. Adding SES to a regression tells you what?
The Coleman report data that we looked at in class is at http://www-stat.stanford.edu/~rag/stat209/coleman.dat
outcome is vach; use momed (we discussed this in class) to predict vach. Examine the regression. Now add ses to
the prediction. What do you conclude about momed. Examine the multiple regression by using an adjusted variables
plot discussed in class.
4. Anna conjecture #1 for SES
Our week 1 in the news The SAT Report on College & Career Readiness: 2012 had a number of interesting analyses and displays.
page 22 of doc and pdf showed a simple logistic regression: PROBABILITY OF ACHIEVING A FIRST-YEAR COLLEGE GPA OF A B- OR HIGHER -- BY SAT PERFORMANCE.
The query is whether the displayed relation is really an artifact of SES influences or is the SAT actually useful. And the "standard" approach would be to add in SES
as a second predictor (with SAT). Does that tell you anything? [I'm gonna make a new data set for this, the one I have is too complicated]
5. Test Equating. Consider Form X with scores found to be distributed N(520, 75) and Form Y with scores N(450, 60) for students in population P
(following the setup in Braun-Holland chapter and handout). Compare equipercentile equating (inverting the cdfs) with linear
equating (scale-translation to mean 0, sd 1 and back) for a sample of 1000 students taking X and 1000 taking Y.
(note: I'll add an easier similar problem with data supplied)
6. More test equating (with ACT data)
The R-package equate contains an equating dataset ACTmath "comes from two
administrations of the ACT mathematics test. The test scores are based on a random
groups design and are contained in a three-column matrix where column one is the 40-point
score scale and columns two and three the number of examinees for forms x and y obtaining
each score point." See p.10 onward of the vignette for equate linked in the Week 6 materials. Compare the results of linear and percentile equating for these two forms; try out some of the methods built into the equate function.
7. Cut-scores, diagnosis. Consider a test with scores having error-variance 64. For a student whose true score is 2pts below the cut-off
what is the probability of success for that student. For a test with reliability .9, what proportion of students who succeed did not "deserve such".
What additional specifications/assumptions did you make to do the calculation?
8. Errors in variables and reliability.
a. For a score with reliability .9, plot a scatterplot of observed vs true scores.
For a true score at the 50th percentile, what is the range of observed scores? Repeat for true score at the 75th percentile?
What is the conditional distribution of observed scores at these values?
b. Reliability versus precision demonstration. Consider a population with true scores distributed Uniform [99,101] and measurement error Uniform [-1, 1].
If you used discrete Uniform in this construction then you could say measurement of change is accurate to 1 part in a hundred.
Calculate the reliability of the score. Also try error Uniform [-2,2], accuracy one part in 50.
9. IRT, Dichotomous items
Often item response data is presented in grouped form--frequency counts of patterns of 0,1 (correct/incorrect) reponses. An example for a set of the LSAT data used in class
is the documentation for the BILOG/MULTILOG programs from SSI at http://www.ssicentral.com/irt/example1.html. You can use the expand.table command in the R-package mirt (Multidimensional Item Response Theory)
(see mirt manual) "The expand.table function expands a summary table of unique response patterns to a full sized
data-set. The response frequencies must be on the rightmost column of the input data."
Use R-package ltm (or equivalent) to get descriptive summaries for these items, fit a 1-parameter and 2-parameter IRT model (as in class handout), compare the models and the item parameters.