pXtXr data: continuous formulation:
Task-wobble, Rater Smear
---------------------------------
Task-wobble, rater-smear models.
The basic spirit of this follows along with the discrete category
misclassification formulation examples.
The topic here is models which are a
story about the processes that generate the data, not a
statistical statement of the analysis (the anova "model").
These stories will adopt the CLBH "continuous" perspective,
(which doesn't produce results much different
from the discrete framework); that is, for purposes
here I'm happy to grant the CLBH wish that judges can give a
score of 3.11 to a paper (however impractical that may be).
I used what I call "Occam's chain saw" to produce the following
simplest possible formulation relevant to the basic
personsXtasksXjudges design that CLBH also use as the initial
example.
BASIC STORY
The substantive setting from which the data come has the following
parts:
we draw a kid from his group or population who has a true level or
true ability "theta"
we present that kid with one or more tasks; and the result
of this kid doing a task produces at least in theory a score:
i.e. the score this paper would receive from a perfect rater/judge.
"task wobble" (which may be large or small depending on how "good"
the task is) is the difference between theta and what the score
the kid would get on this particular task (perfectly scored).
(Linn "task specificity" ?)
we present the kids completed task(s) to one or more raters, who produce
fallible (maybe pretty accurate, maybe awful) scores because raters
introduce "rater smear".
So here's one realization of the task-wobble,
rater-smear formulation for the pXtXr design.
DATA GENERATION
Data will be scores continuous on some range (e.g. 1-6 scores being
continuous on .5-to-6.5); i.e., judges can give a score of 3.11 to a
paper.
Data generation for task-wobble, rater-smear goes like this.
Underlying formulation Contininuous data, following development of CLBH.
Observed data (from a performance assessment) has
discrete scoring 1-6 scale with category boundaries 1.5 2.5 3.5 4.5 5.5
1. For each individual, start with a true ability (like CLBH
"true level" top p.8) which I'll call theta.
For a population (or a school) theta can have various
distributions: a good variety is provided by
a. Continuous Uniform on .5, 6.5 (mean 3.5 ; var 3.0)
b. Triangular on .5, 6.5 (mean 3.5 ; var 1.5)
or on 1 to 6 (mean 3.5, var 25/24)
c. Skew distribution on [.5. 6.5] yielding true category
proportions .25 .25 .25 .10 .10 .05 constructed with
continuous Uniform between score boundaries +/- .50.
mean 2.7; variance 2.1933.
2. Once an individual's theta is drawn, then we need to construct
that individual's score on a specified task, where the
student's paper is assumed to be judged/rated with *perfect*
accuracy
To get the score from a perfectly judged paper simply add task
wobble to theta. Task wobble--to keep close to the vision
of G-theory--is represented by a Gaussian random variable with mean
0 (choice of task introduces no bias in the score) and variance T--
N(0, T). Start also by assuming assume tasks are interchangable
(introduce the same wobble) to come close to G-thoery.
This is deliberately comic-book but it leads to interesting
results.
3. For an individual's score on a perfectly judged paper, add rater
smear.
Again represent unrealistically, but to match G-theory, by a mean 0
(no bias) variance R (raters interchangable to start) Gaussian random
variable. And thus add the N(0, R) random variable to the score in
part 2 for a person taking a task which is perfectly rated.
For example data from a pXtXj design say with two tasks and two raters
would have 4 scores (tasks*raters) for each individual with each rater
scoring the two tasks each person takes.
DATA EXAMPLE RESULTS
The results of running the G-theory pXtXj anova for pure task-wobble,
rater smear are given in pXtXr results .
More complex versions of task-wobble, rater smear are given in the
quest to reproduce CLBH variance components
And it really doesn't much matter whether scores are continuous
over [.5, 6.5] (i.e. allowing 3.11 scores) or we take the
discretized (1 2 3 4 5 6) scores.
Among the main results are that the simple task-wobble rater-smear
data generation produces the pattern of results that is most often seen
in empirical studies (at least the one's I know about)--big variance
components for p, pXt, confounded error; nothing for r pXr, tXr.
Also component for p is badly understated.
Among the demonstrations herein (from the primordial set) is that no
matter how bad raters become, nothing in the G-theory gospel prior to
CLBH would indicate bad raters. And that raters are not the problem
is the one thing the literature claims to have established from the
empirical studies. CLBH seems to run away from prior G-theory
interpretations of variance components, for real good reason.