pXtXr data: continuous formulation:
Task-wobble, Rater Smear


---------------------------------
             Task-wobble, rater-smear models.

The basic spirit of this follows along with the discrete category 
misclassification formulation examples. 

The topic here is models which are a 
story about the processes that generate the data, not a 
statistical statement of the analysis (the anova "model").  
These stories will adopt the CLBH "continuous" perspective, 
(which doesn't produce results much different
from the discrete framework); that is, for purposes 
here I'm happy to grant the CLBH wish that judges can give a 
score of 3.11 to a paper (however impractical that may be).
I used what I call "Occam's chain saw" to produce the following
simplest possible formulation relevant to the basic 
personsXtasksXjudges design that CLBH also use as the initial 
example.

           BASIC STORY

The substantive setting from which the data come has the following 
parts:
we draw a kid from his group or population who has a true level or
  true ability "theta"
we present that kid with one or more tasks; and the result
  of this kid doing a task produces at least in theory a score:
  i.e. the score this paper would receive from a perfect rater/judge.
  "task wobble" (which may be large or small depending on how "good"
   the task is) is the difference between theta and what the score
   the kid would get on this particular task (perfectly scored).
   (Linn "task specificity" ?)
we present the kids completed task(s) to one or more raters, who produce
  fallible (maybe pretty accurate, maybe awful) scores because raters
  introduce "rater smear". 

So here's one realization of the task-wobble, 
rater-smear formulation for the pXtXr design.

              DATA GENERATION

Data will be scores continuous on some range (e.g. 1-6 scores being 
continuous on .5-to-6.5); i.e., judges can give a score of 3.11 to a 
paper.

Data generation for task-wobble, rater-smear goes like this.
Underlying formulation Contininuous data, following development of CLBH.
Observed data (from a performance assessment) has
discrete scoring 1-6 scale with category boundaries 1.5 2.5 3.5 4.5 5.5


1.  For each individual, start with a true ability (like CLBH
    "true level" top p.8) which I'll call theta.  
     For a population (or a school) theta can have various 
     distributions:  a good variety is provided by
     a.  Continuous Uniform on .5, 6.5  (mean 3.5 ; var 3.0)
     b.  Triangular on .5, 6.5  (mean 3.5 ; var 1.5) 
         or on 1 to 6 (mean 3.5, var 25/24)
     c.  Skew distribution on [.5. 6.5] yielding true category
          proportions .25 .25 .25 .10 .10 .05 constructed with
          continuous Uniform between score boundaries +/- .50.
          mean 2.7; variance 2.1933.

2. Once an individual's theta is drawn, then we need to construct
   that individual's score on a specified task, where the
   student's paper is assumed to be judged/rated with *perfect*
   accuracy 
   To get the score from a perfectly judged paper simply add task 
   wobble to theta.  Task wobble--to keep close to the  vision 
   of G-theory--is represented by a Gaussian random variable with mean 
   0 (choice of task introduces no bias in the score) and variance T-- 
   N(0, T).  Start also by assuming assume tasks are interchangable 
   (introduce the same wobble) to come close to G-thoery.
   This is deliberately comic-book but it leads to interesting 
   results.

3.  For an individual's score on a perfectly judged paper, add rater 
    smear.
    Again represent unrealistically, but to match G-theory, by a mean 0 
    (no bias) variance R (raters interchangable to start) Gaussian random 
    variable.  And thus add the N(0, R) random variable to the score in 
    part 2 for a person taking a task which is perfectly rated.

For example data from a pXtXj design say with two tasks and two raters
would have 4 scores (tasks*raters) for each individual with each rater
scoring the two tasks each person takes.


         DATA EXAMPLE RESULTS

The results of running the G-theory pXtXj anova for pure task-wobble, 
rater smear are given in  pXtXr results  . 
More complex versions of task-wobble, rater smear are given in the 
 quest to reproduce CLBH variance components  
And it really doesn't much matter whether scores are continuous
over [.5, 6.5] (i.e. allowing 3.11 scores) or we take the
discretized (1 2 3 4 5 6) scores.

Among the main results are that the simple task-wobble rater-smear
data generation produces the pattern of results that is most often seen
in empirical studies (at least the one's I know about)--big variance
components for p, pXt, confounded error; nothing for r pXr, tXr.
Also component for p is badly understated.

Among the demonstrations herein (from the primordial set) is that no 
matter how bad raters become, nothing in the G-theory gospel  prior to 
CLBH would indicate bad raters.  And that raters are not the problem 
is the one thing the literature claims to have established from the 
empirical studies. CLBH seems to run away from prior G-theory 
interpretations of variance components, for real good reason.