Computational Statistics:
Stat 227


May 25, 2005

Contents

1 Logistics
 1.1 Prerequisites
 1.2 Text
 1.3 Assignments
 1.4 Projects
 1.5 Office Hours
 1.6 Course Web Site
 1.7 Startup computer labs Sequoia 201
 1.8 Outline
 1.9 Statistical Methods
 1.10 Algorithms
 1.11 Software
2 Homeworks
3 Numerical Analysis for Statisticians
 3.1 If you missed some lectures:
 3.2 Real numbers: Floating Point Arithmetic
 3.3 Matrices, and their decompositions
 3.4 The linear equation solving problem
  3.4.1 Square systems
  3.4.2 Rectangular system: QR decomposition and least squares
  3.4.3 Examples
  3.4.4 Givens Rotations
  3.4.5 R examples handout
  3.4.6 Backsubstitution
 3.5 Householder Functions
  3.5.1 Householder Multiplication-Row
 3.6 QR decomposition in R
 3.7 Singular value decompostion and pc regression
  3.7.1 Ridge Regression
  3.7.2 Principal Components Regression
  3.7.3 Principal Components
 3.8 More than I said on the Singular Value Decomposition
  3.8.1 Generalized Inverses
 3.9 Principal Components
  3.9.1 Eigenvalue Analysis
4 EM algorithm
  4.0.2 Implementation
  4.0.3 Estimating Mixture Proportions
  4.0.4 EM for exponential families
5 Monte Carlo Simulations
 5.1 Monte Carlo methods
  5.1.1 Antithetic Resampling
  5.1.2 Importance Sampling
 5.2 Random Number Generation- Uniform[0, 1]
  5.2.1 Congruential Methods
  5.2.2 Other `Better?' Methods
  5.2.3 Random Number Generation- other than Uniform
  5.2.4 Inversion Method
  5.2.5 Rejection Methods
  5.2.6 Table Lookup
 5.3 Specialized Methods
  5.3.1 Polar methods for the Normal
 5.4 Importance Sampling
6 The Bootstrap, Permutation Tests, Simulation
 6.1 Motivation
 6.2 The bootstrap : the univariate context
 6.3 In practice
 6.4 S examples for simple bootstraps
 6.5 Parametric bootstrap
 6.6 Smoothed Bootstrap
 6.7 Dierent Uses for bootstrapping
  6.7.1 Quality of Estimates
 6.8 Enhancements
 6.9 Bootstrap-t : Studentizing
 6.10 R functions
 6.11 Preliminary Transformations of the Estimate
 6.12 Pre-pivoting or Double Bootstrap
 6.13 Balanced Bootstrap : saving computations
 6.14 Antithetic Sampling : more on saving computations
 6.15 Bootstrap and Randomization Tests
  6.15.1 An example : Infants walking
  6.15.2 More generally
 6.16 Finding out how good a parametric bootstrap procedure is through a simulation experience :
 6.17 Maximum Likelihood Estimation
  6.17.1 Maximum Likelihood of Multinomial Cell Probabilities
  6.17.2 The Hardy Weinberg Example
7 Density Estimation
8 References

Chapter 1
Logistics

Time and Place: MW 11.-12.15, Hewlett 101, please do not arrive late and distract your colleagues

1.1 Prerequisites

Computer Science 106(Introd. to Computer Programming) or equivalent
Stat 116 (Probability) or equivalent
Math 113 (Matrix Algebra) or equivalent

1.2 Text

Computational Statistics, Geof Givens and Jennifer Hoeting. Wiley 2005

1.3 Assignments

There will be four homeworks/labs to hand in 40 % and one big project ( final) 60 %

1.4 Projects

Projects

  1. Pick a paper from the projects below or on the website on the course webpage at
    http://www-stat.stanford.edu/~susan/courses/s227/.
  2. Write a scientific introduction to the paper you have chosen: context and state of the art at the time of the paper (eventually if the paper is old, what has changed in the area since the paper was written).
  3. Write an algorithm explaining what the method in the paper does.
  4. Simulate data from a known distribution appropriate for the analysis in the paper.
  5. Replicate the method in the paper by writing an R program that implements one of the methods implemented in the paper and show it does what the authors say it does on your simulated data.
  6. Analyze some real data with your R program.
  7. Provide a complete bibliography.
    Bonus: Make a R package with the simulated data, example data and documented programs.

    Hint: the more original the work the higher the grade.

Papers for projects

Gibbs Sampler
Estimating the number of essential genes in a genome by random transposon mutagenesis Natalie J. Blades and Karl W. Broman, technical report, Johns Hopkins.
Coupling from the Past
Exact Sampling with Coupled Markov Chains and Applications to Statistical Mechanics, (1996) Jim Propp and David Wilson
http://www.math.wisc.edu/~propp/sample-short.ps.gz
Sequential Inportance Sampling for Tables
Chen, Y., Diaconis, P., Holmes, S. and Liu, J.S. (2005). Sequential Monte Carlo Methods for Statistical Analysis of Tables. Journal of the American Statistical Association, 100, 109-120.
http://www-stat.stanford.edu/~susan/papers/chenJASA05.pdf
Chen, Y., Dinwoodie, I. H., and Sullivant, S. (2004).
http://www.stat.duke.edu/~yuguo/paper/sis-4.pdf
Mixtures
Bayesian analysis of Poisson mixtures (Peter Green and Valerie Viallefont and Sylvia Richardson)
http://www.stats.bris.ac.uk/~peter/papers/poisjns2.ps
Bayesian Modelling and Inference on Mixtures of Distributions Jean-Michel Marin, Kerrie Mengersen and Christian P. Robert Handbook of Statistics 25, Elsevier
http://eprints.qut.edu.au/archive/00000901/02/Bayesian_Modelling.pdf
Reversible jump Monte Carlo
Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika, 82, 711-732 (1995).
http://www.stats.bris.ac.uk/~peter/papers/RJMCMCBka.pdf
Exhaustive Methods
http://www-stat.stanford.edu/~susan/courses/s208/node12.html
P. Diaconis and S. Holmes. Gray codes for randomization procedures. Statistics and Computing, 4:287-302, 1994.
Particle Filtering
Doucet, A., de Freitas, J.F.G. and Gordon, N.J. (2001). Sequential Monte Carlo Methods in Practice. New York: Springer-Verlag.
Doucet, A., Godsill, S.J. and Andrieu, C. (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statist. Comput. 10 197-208.
Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Statist. 5 1-25.
Liu, J.S. and Chen, R. (1998). Sequential Monte Carlo methods for dynamic systems. J. Amer. Statist. Assoc. 93 1032-1044.
Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing. New York: Springer-Verlag.
Pitt, M.K. and Shephard, N. (1999). Filtering via simulation: auxiliary particle filter. J. Amer. Statist. Assoc. 94 590-599.
Data Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html

1.5 Office Hours

Professor Susan Holmes: Wednesday at 1.45, in Sequoia 102.
By email appointment to susan at stat.stanford.edu.
TA: Patrick Perry (patperry at stanford.edu), room 108.
Office hours: Monday 2.30-4.00pm.

1.6 Course Web Site

This will contain a bulletin board, homeworks, course summary, project description list, reading list, links to useful sites with in particular tutorials, software information, etc...

Address:http://www-stat.edu/~susan/courses/s227/

Weekly consultation of the web site will be necessary and expected of all students. Homework and lab. assignments and solutions will be on the course's coursework site.

1.7 Startup computer labs Sequoia 201

There will be 3 sets of labs with exercices on accessing the web and usin R.

Times: TBA

1.8 Outline

The course will contain 3 parts, here is a list containing both subjects covered in class and potential project matter:

1.9 Statistical Methods

Maximum Likelihood, EM algorithm, Missing Data,
Bootstrap, Cross-Validation
Step by step multivariate regression, robust regression,
Non-parametric Regression, Alternate Conditional Expectation, Projection Pursuit,
(Neural Nets)
Principal Components, Correspondence Analysis,
Classification and Regression Trees, Clustering,
Nonparametric Confidence Regions, (convex hulls),
Approximate Counting, Exhaustive Enumeration, Non parametric Tests.

1.10 Algorithms

Matrix manipulations, decompositions, (cholesky, QR,...),
Singular Value Decompositions, Eigenanalysis,
Iterative methods, Gauss-Seidel, Conjugate Gradient,
Downhill Simplex minimization,
Smoothing,
Random Number Generation ,
(Uniform, Normal, Beta, Inverse CDF, Acceptance Rejection, Metropolis),
Markov Chain approach to Combinatorial Optimization (Simulated Annealing), Updating,
Markov Chain Monte Carlo, Gray Codes
Sorting.

1.11 Software

Most methods exist in high level programming environments such as R.

We will learn about several additional R functions/packages such as utils (Sweave), mva,ade4, boot, MCMCpack, Rggobi, rpart, rrcov, modreg,....

Symbolic software will be introduced, if not used much. (maple)

Chapter 2
Homeworks

Hw 1 to hand in before Tuesday, April 19th at 4.30pm Rlab1.pdf
See coursework for solutions.

Hw 2 to hand in before Thursday, April 28th at 4.15pm hw2.pdf

Hw 3 to hand in before Monday, May 9th at 4.15pm Chapter 6, problems 6.4, 6.5 and 6.6.

Hint: If you are having trouble with the problem 6.4, there is a dierent approach using Gibbs sampling that is given in detail here:
http://psblade.ucdavis.edu/BMSBSA/chapter09.ps, this might help you somewhat with a simplification of your calculations.

Hw 4 to hand in before Monday, May 23th at 4.15pm

  1. Pick a paper from the projects below or on the website on the cours e webpage at
    http://www-stat.stanford.edu/~susan/courses/s227/.
  2. Write a summary of the paper you have chosen in 2 paragraphs.
  3. Explain what you plan to do in your final project.
  4. Chapter 9, do problem 9.6.
  5. Chapter 10, number 10.1 except for (e).
    Hint: Look at the packages quantreg, locfit and lokern for adaptive bandwidth methods.
    Very nice set of notes that will help you (in particular page 11):
    http://www.stat.berkeley.edu/~stark/Teach/S240-s03/Notes/ch8.pdf

Chapter 3
Numerical Analysis for Statisticians

3.1 If you missed some lectures:

Don’t make a habit of it :75 minutes gained will probably cost you the triple to catch up!

I won't be explicitly writing all my lectures as notes on the web, but for the benefit of those who missed a lecture, here is a little of what I said, so you get an idea about at what level the course will be:

3.2 Real numbers: Floating Point Arithmetic

                        e
x = ± [.d1d2d3 ...dt]× b
x = (- 1)s[.d d d d  ...d   ]× bE -e
            0 1 2 3     t- 1

These are normalized so that the first digit of the mantissa is non zero, in the binary mode, (b = 2), this means we don't have to code the first one, sometimes it is left out (phantom of the digit).

If x < bEmin-1 (smallest number representable in machine)
we have underflow,
if x > bEmax(1 - b-t) overflow.

Otherwise x is in the representable range, maybe we won't be able to represent it exactly. Macine Accuracy:
Smallest fp number added to 1 that gives an answer dierent from 1. (eps in matlab)

Now if we can't represent x without representation error, we'll call x*

Taylor series show how much we are going to be o.

When error growth is linear, the sensitivity is small: f is said to be well-conditionned. k relative to the data x and the problem S: is the constant that bounds the relative error.

||dS ||     ||dx||
------< k -----
||S||     ||x||

Relative to the norm, the problem and that particular data x.

Worse still, we usually don't have f, but an approximation of it. this induces a second type of error: generated error, also called truncation error because we stop computing before we have the exact right answer(for an integral computed by trapezoids, or any iterative method for computing functions by series, ...) for example.

f (x)-  f * (x*) = [f(x) - f(x*)] + [f(x*) - f *(x*)]
Stability:
The precision with which f* approximates f is called the stability of f. If for an input x there is another ~xclose to x such that f * (x) = f(~x), then the f* algorithm is said to be stable.

An unstable algorithm is one where the roundo error gets mixed into the calculation at an early stage and is successively magnified until it swamps the true answer.

Example: Want to compute the powers of the golden mean:

      V~ --
       5-  1 .
f =  ---2--- = .61803398
Also we have the recurrence relation:
 n+1     n-1    n
f    = f    - f
But it has another solution.
       V~ --
    ----5---1
f =     2

Error analysis:

Moment computations, (see Pages 612-613 of Numerical Recipes)

A little example of how mistakes tote up on a total.

Vectors These are the basic building blocks of the new generation of languages, unlike C or fortran whose are fp numbers, the gains are important in writing all computations vectorially.

3.3 Matrices, and their decompositions

Matrices Notations, operations, partitionning, careful with the defaults from one system to another.

The Botany of Matrices Diagonal, Symmetric, Banded, Hessenberg, Circulant, Hankel, Triangular, Sparse, Orthogonal, semi-positive definite,

My notations for: transpose, inner product,norm, matrix norms,

3.4 The linear equation solving problem

Suppose we have a linear system of equtions, typified by the matrix of coefficients A, unknowns, x and the right hand sides, the vector b.

3.4.1 Square systems

Singularity, Degenerate Systems
If the rows are linear combinations of one another, or the columns linear combinations or one another then even if there seems to be the same number of unknowns as equations, the system is in fact equivalent to a singular set of equations.

Close singularity is just as bad because the roundo could take us there, we will have to measure this.

Single precision is usually enough for 30-50 equations.

Double precision makes it possible to solve for a few hundred.

This does NOT hold for pathological systems we will look into these and how to recognize them when we talk about conditionning.

A full square system
:

Suppose the matrix of coefficients is a square matrix, (of full rank also, as we saw, non-degenerate) then the equation looks like this:

Ax  = b

Formally you would say : invert it, in fact a computational mistake, except if you have multiple right hand sides.

Doing what one would first think of doing if you had to solve a system by hand.

The easiest system: triangular.

Bringing to triangular shape.

Pivotting. This is NOT to solve ill-conditionning, but unlucky diagonal elements that are very small make the algorithm unstable.

LU decomposition, with a permutation matrix (prepivotting) is what is best to do.

A = LU,  L lower  triang.U  upper triang.

Flops: Of an order n3
2- (the same as the cost of the multiplication of two matrices, not so bad).

Condition number for the inversion problem

Try a small perturbation of b to b* = b + db, call the solution of the perturbed system: x* = x + dx,

A(x  + dx)  =   b + db
    x + dx  =   A- 1(b + db)
                 - 1
        dx  =   A   db
     ||dx|| <   ||A -1||||db||
       ||b|| <   ||A ||||x ||

     ||dx-||  <   ||A ||||A -1||||db||
      ||x||                 ||b||
The condition number for the data A and the linear solving problem is thus:
k =  ||A ||||A -1||
(depends on the norm, not on the algorithm). For the two norm we have a better explanation of how to compute this, we need the following to do it:
Singular Value Decomposition
This is the most important decomposition in statistics.

We will study it several times, here is my first introduction: a little made-up example with matlab to start with:

>> u=[3 1 -1 2]'  
u =  
     3  
     1  
    -1  
     2  
>> v=(1:4)'  
v =  
     1  
     2  
     3  
     4  
>> X=u*v'  
X =  
     3     6     9    12  
     1     2     3     4  
    -1    -2    -3    -4  
     2     4     6     8  
>>svd(X)  
ans =  
   21.2132  
    0.0000  
         0  
         0  
 
>> E=10^(-3)*randn(4);  
>> XE=X+E  
XE =  
 
    3.0012    5.9993    9.0003   12.0012  
    1.0006    2.0017    3.0009    3.9994  
   -0.9999   -1.9999   -3.0014   -3.9994  
    2.0004    4.0018    5.9993    7.9996  
 
>> svd(XE)  
ans =  
   21.2143  
    0.0028  
    0.0019  
    0.0005  
>> cond(X)  
Condition is infinite  
ans =  
   Inf  
>> cond(XE)  
ans =  
 
   4.2567e+04  
%An example you can't see with your bare eyes:  
>> X2=u'*v  
X2 =  
    0.2468    0.2499    0.0166    0.2487  
    0.3225    0.3266    0.0218    0.3250  
    0.1495    0.1514    0.0101    0.1506  
    0.4367    0.4423    0.0295    0.4401  
 
>> svd(X2)  
ans =  
 
    1.0730  
    0.0000  
         0  
         0  
>> [U,S,V]=svd(X2);  
>> U*S*V'  
ans =  
    0.2468    0.2499    0.0166    0.2487  
    0.3225    0.3266    0.0218    0.3250  
    0.1495    0.1514    0.0101    0.1506  
    0.4367    0.4423    0.0295    0.4401  
>> 10000*(X2-U*S*V')  
ans =  
    1.0e-11 *  
   -0.0278    0.0833   -0.0035    0.0555  
   -0.0555    0.0555         0    0.0555  
         0    0.0555    0.0017    0.0278  
   -0.0555    0.1665         0    0.1110

>> A=rand(4)  
A =  
    0.5045    0.4940    0.0737    0.9138  
    0.5163    0.2661    0.5007    0.5297  
    0.3190    0.0907    0.3841    0.4644  
    0.9866    0.9478    0.2771    0.9410  
>> flops(0)  
>> [L,U,P]=lu(A)  
L =  
     1.0000         0         0         0  
    0.5233    1.0000         0         0  
    0.5114   -0.0406    1.0000         0  
    0.3234    0.9388    0.7363    1.0000  
U =  
     0.9866    0.9478    0.2771    0.9410  
         0   -0.2298    0.3557    0.0373  
         0         0   -0.0535    0.4342  
         0         0         0   -0.1945  
>> flops  
ans =  
    34  
>> P*A  
ans =  
    0.9866    0.9478    0.2771    0.9410  
    0.5163    0.2661    0.5007    0.5297  
    0.5045    0.4940    0.0737    0.9138  
    0.3190    0.0907    0.3841    0.4644  
>> L*U  
ans =  
     0.9866    0.9478    0.2771    0.9410  
    0.5163    0.2661    0.5007    0.5297  
    0.5045    0.4940    0.0737    0.9138  
    0.3190    0.0907    0.3841    0.4644  
>> P*A-L*U  
ans =    1.0e-15 *  
         0         0         0         0  
   -0.1110         0         0         0  
         0         0         0         0  
   -0.0555   -0.0139         0         0  
 

Here is what we need to remember:

X  =  U SV  ',V 'V  =  I,U 'U  =  I,S  diagonal   s
                                                   i
Actually the singular values are the square roots of the eigenvalues of X'X.

The condition number in l2 norm has the simple interpretation as a function of the singular values, ration of the largest to the smallest:

     s1-
k =  sn

Read 2.6 of NR (pages 61-71).

Tridiagonal

Of an order of N steps are sufficient, see NR 51.

Iterative Improvement

This enables, a the price of storing A, to improve the solution by using the residual. Do the following in double precision:

           *
r = b - Ax  ,solve Ly   =  P r f or y
              and  U z  =  y for z
                 x      =  x* + z
                   new
                Axnew   =  A(x* +  z) = (b- r) + Az =  b
Cholesky Decomposition
For a symmetric positive definite matrix A the LU decomposition can be rewritten:
        '       1/2   1/2 '
A =  GG  =  (LD    )(D    L )
with D a diagonal matrix with only positive entries.

3.4.2 Rectangular system: QR decomposition and least squares

When there are more equations than variables, one can't hope for an exact solution, the system is overdetermined.

We try to minimize the error which is usually measured as the norm

||Ax  - b||, for the equation Ax =  b

Important remark: For any orthogonal Q, this norm is unchanged by trnasformation by Q, so

||Ax - b||=  ||Q'Ax  - Q'b||
we try and find a Q that makes the minimization particularly simple, that is so that Q'A has a particularly simple form.

3.4.3 Examples

Least Squares in Two dierent ways:

>> t = (1900:10:1990)';  
>> y = [75.995 91.972 105.711 123.203 131.669 ...  
     150.697 179.323 203.212 226.505 249.633]';  
>> X=[t ones(10,1)]  
X =     1900           1  
        1910           1  
        1920           1  
        1930           1  
        1940           1  
        1950           1  
        1960           1  
        1970           1  
        1980           1  
        1990           1  
>> bhat=X\p  
bhat =    1.93  
      -3594.01  
>>yhat=polyval(bhat,t);  
>>[polyval(bhat,t) y]  
ans =    67.08         76.00  
         86.35         91.97  
        105.62        105.71  
        124.89        123.20  
        144.16        131.67  
        163.43        150.70  
        182.70        179.32  
        201.96        203.21  
        221.23        226.50  
        240.50        249.63  
 
>> norm(yhat-y)  
ans =  
   23.5794

>> [Q,RC]=qr([X y]);  
RC =  
      -6151.30         -3.16       -488.86  
             0         -0.05        167.82  
             0             0         23.58  
             0             0             0  
             0             0             0  
             0             0             0  
             0             0             0  
             0             0             0  
             0             0             0  
             0             0             0  
>>  R1=RC(1:2,1:2)  
R1 =    -6151.30         -3.16  
              0          -0.05  
>> C1=RC(1:2,3)  
C1 = -488.8643  
      167.8181  
>> R1\C1  
ans =     1.93  
      -3594.01  
>> C2=RC(3,3)  
C2 = 23.5794  
>>[Q*[C1;zeros(8,1)] yhat]  
ans      67.08         67.08  
         86.35         86.35  
        105.62        105.62  
        124.89        124.89  
        144.16        144.16  
        163.43        163.43  
        182.70        182.70  
        201.96        201.96  
        221.23        221.23  
        240.50        240.50

3.4.4 Givens Rotations

What are Given's rotations good for?

You can use them to zero out individual isolated elements in any matrix, without changing any of the norms of the vectors, these transformations are orthogonal.

They are based on simple 2 × 2 ones of the form:

[        ]
   c   s

  - s  c

You can choose c and s so that a certain vector is orthogonally transformed (actually, rotated) into a vector y = (y1,y2) whose second coordinate is zero.

This is of course easy by solving the equations

           2   2
          c + s  = 1                                   (3.1)
y2 = - sx1 + cx2 = 0                                   (3.2)

                                                       (3.3)

The examples that follow show two ways of doing this:

%%%%Van Loan's Function, Chapter 7%%%%%%%%  
  function [c,s] = Rotate(x1,x2);  
% Pre:  
%   x1,x2   scalars  
% Post:  
%   c,s     c^2+s^2=1 so -s*x1 + c*x2 = 0.  
%  
   if x2==0  
      c = 1;  
          s = 0;  
   else  
      if abs(x2)>=abs(x1)  
             cotangent = x1/x2;  
                 s = 1/sqrt(1+cotangent^2);  
                 c = s*cotangent;  
          else  
             tangent = x2/x1;  
                 c = 1/sqrt(1+tangent^2);  
                 s = c*tangent;  
          end  
   end  
%%%%%%%%%%%%%%%The resident function:%%%%%%%  
function [G,x] = planerot(x)  
%PLANEROT Generate a Givens plane rotation.  
%       [G,Y] = PLANEROT(X), where X is a 2-component column vector,  
%       returns a 2-by-2 orthogonal matrix G so that Y = G*X has Y(2) = 0.  
%       See also QRINSERT, QRDELETE.  
if x(2) ~= 0  
   r = norm(x);  
   G = [x'; -x(2) x(1)]/r;  
   x = [r; 0];  
else  
   G = eye(2);  
end  
 
 
>> x=[2 3]';  
>> [G,y]=planerot(x)  
 
G =  
          0.55          0.83  
         -0.83          0.55  
y =  
          3.61  
             0

Now in order to extend this to a general context you just notice that if you fill out the rest of the matrix with 1's on the diagonal, it will not change the other coordinates thus:

 |_                   _| 
    c   0  s  0  0
    0   1  0  0  0

   - s  0  c  0  0

 |_   0   0  0  1  0  _| 
    0   0  0  0  1
is a rotation in the (1,3) plane in 5 dimensions. In order to reduce to upper triangular form, one cycles through all the elements that you want to zero out one by one.
 |_                _|     |_                 _|     |_                _| 
   ×  ×   ×   ×         ×   ×  ×   ×         ×   ×   ×  ×

   ×  ×   ×   ×         ×   ×  ×   ×         ×   ×   ×  ×

   ×  ×   ×   ×   -->     ×   ×  ×   ×    -->    ×   ×   ×  ×
 |_  ×  ×   ×   ×  _|     |_  ×   ×  ×   ×   _|     |_  0  ×   ×  ×   _| 

   ×  ×   ×   ×         0   ×  ×   ×          0  ×   ×  ×

3.4.5 R examples handout

3.4.6 Backsubstitution

backsub=function(X,y){  
#Backsubstitution to find solution in a triangular case  
l=dim(X)  
p=l[2]  
b=matrix(0,p,1)  
b[p,1]=y[p,1]/X[pp]  
for (j in seq(p-1,1,-1){  
b[j,1]=(y[j,1]-X[j,(j+1):n]*b[(j+1):n,1])/X[j,j]  
}  
return(b)}

3.5 Householder Functions

Algorithm  (Householder Vector) in R  
 
house= function(x){  
#Householder vector calculation  
p = length(x);  
mu = sqrt(x%*%x)  
v = x  
if (mu!= 0)  
{  
beta = x[1] + sign(x[1])*mu  
 
v[2 : p,1] = v[2 : p,1]/beta  
}  
v[1] = 1  
return(v)  
}

Algorithm  (Householder Vector) in matlab  
 
function v = house(x);  
%Compute a Householder vector  
m = length(x);  
mu = norm(x,2);  
v = x;  
if mu ~= 0  
       beta = x(1) + sign(x(1))*mu;  
      v(2 : m,1) = v(2 : m,1)/beta;  
end  
v(1) = 1

3.5.1 Householder Multiplication-Row

rowhouse=function(X,v){  
beta=-2/(t(v)%*%v)  
w=beta*t(X)%*%v  
X=X+v%*%t(w)  
return(X)}

3.6 QR decomposition in R

qr                   package:base                   R Documentation  
The QR Decomposition of a Matrix  
Description:  
     'qr' computes the QR decomposition of a matrix.  It provides an  
     interface to the techniques used in the LINPACK routine DQRDC or  
     the LAPACK routines DGEQP3 and (for complex matrices) ZGEQP3.  
Usage:  
     qr(x, tol = 1e-07 , LAPACK = FALSE)  
     qr.coef(qr, y)  
     qr.qy(qr, y)  
     qr.qty(qr, y)  
     qr.resid(qr, y)  
     qr.fitted(qr, y, k = qr$rank)  
     qr.solve(a, b, tol = 1e-7)  
     ## S3 method for class 'qr':  
     solve(a, b, ...)  
     is.qr(x)  
     as.qr(x)  
 
Arguments:  
       x: a matrix whose QR decomposition is to be computed.  
     tol: the tolerance for detecting linear dependencies in the  
          columns of 'x'. Only used if 'LAPACK' is false and 'x' is  
          real.  
      qr: a QR decomposition of the type computed by 'qr'.  
    y, b: a vector or matrix of right-hand sides of equations.  
       a: A QR decomposition or ('qr.solve' only) a rectangular matrix.  
       k: effective rank.  
  LAPACK: logical. For real 'x', if true use LAPACK otherwise use  
          LINPACK.  
     ...: further arguments passed to or from other methods  
Details:  
     The QR decomposition plays an important role in many statistical  
     techniques.  In particular it can be used to solve the equation  
     *Ax* = *b* for given matrix *A*, and vector *b*.  It is useful for  
     computing regression coefficients and in applying the  
     Newton-Raphson algorithm.  
 
     The functions 'qr.coef', 'qr.resid', and 'qr.fitted' return the  
     coefficients, residuals and fitted values obtained when fitting  
     'y' to the matrix with QR decomposition 'qr'. 'qr.qy' and 'qr.qty'  
     return 'Q %*% y' and 't(Q) %*% y', where 'Q' is the (complete) *Q*  
     matrix.  
     All the above functions keep 'dimnames' (and 'names') of 'x' and  
     'y' if there are.  
     'solve.qr' is the method for 'solve' for 'qr' objects. 'qr.solve'  
     solves systems of equations via the QR decomposition: if 'a' is a  
     QR decomposition it is the same as 'solve.qr', but if 'a' is a  
     rectangular matrix the QR decomposition is computed first.  Either  
     will handle over- and under-determined systems, providing a  
     minimal-length solution or a least-squares fit if appropriate.  
     'is.qr' returns 'TRUE' if 'x' is a 'list' with components named  
     'qr', 'rank' and 'qraux' and 'FALSE' otherwise.  
 
     It is not possible to coerce objects to mode '"qr"'.  Objects  
     either are QR decompositions or they are not.  
Value:  
     The QR decomposition of the matrix as computed by LINPACK or  
     LAPACK. The components in the returned value correspond directly  
     to the values returned by DQRDC/DGEQP3/ZGEQP3.  
 
      qr: a matrix with the same dimensions as 'x'. The upper triangle  
          contains the *R* of the decomposition and the lower triangle  
          contains information on the *Q* of the decomposition (stored  
          in compact form).  Note that the storage used by DQRDC and  
          DGEQP3 differs.  
   qraux: a vector of length 'ncol(x)' which contains additional  
          information on *Q*.  
 
    rank: the rank of 'x' as computed by the decomposition: always full  
          rank in the LAPACK case.  
 
   pivot: information on the pivoting strategy used during the  
          decomposition.  
     Non-complex QR objects computed by LAPACK have the attribute  
     '"useLAPACK"' with value 'TRUE'.  
Note:  
     To compute the determinant of a matrix (do you _really_ need it?),  
     the QR decomposition is much more efficient than using Eigen  
     values ('eigen').  See 'det'.  
     Using LAPACK (including in the complex case) uses column pivoting  
     and does not attempt to detect rank-deficient matrices.  
References:  
     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S  
     Language_. Wadsworth & Brooks/Cole.  
     Dongarra, J. J., Bunch, J. R., Moler, C. B. and Stewart, G. W.  
     (1978) _LINPACK Users Guide._  Philadelphia: SIAM Publications.  
     Anderson. E. and ten others (1999) _LAPACK Users' Guide_. Third  
     Edition. SIAM.  
      Available on-line at <URL:  
     http://www.netlib.org/lapack/lug/lapack_lug.html>.  
See Also:  
     'qr.Q',  'qr.R',  'qr.X' for reconstruction of the matrices.  
     'lm.fit',  'lsfit', 'eigen', 'svd'.  
     'det' (using 'qr') to compute the determinant of a matrix.

Examples:  
hilbert <- function(n) { i <- 1:n; 1 / outer(i - 1, i, "+") }  
 h9 <- hilbert(9); h9  
     qr(h9)$rank           #--> only 7  
     qrh9 <- qr(h9, tol = 1e-10)  
     qrh9$rank             #--> 9  
     ##-- Solve linear equation system  H %*% x = y :  
    y <- 1:9/10  
     x <- qr.solve(h9, y, tol = 1e-10) # or equivalently :  
     x <- qr.coef(qrh9, y) #-- is == but much better than  
                           #-- solve(h9) %*% y  
     h9 %*% x              # = y

h9 %*% (solve(h9) %*% y)  
            [,1]  
 [1,] 0.09999996  
 [2,] 0.20000011  
 [3,] 0.30000012  
 [4,] 0.40000010  
 [5,] 0.50000008  
 [6,] 0.60000007  
 [7,] 0.70000005  
 [8,] 0.80000004  
 [9,] 0.90000003  
> h9 %*% qr.coef(qrh9, y)  
      [,1]  
 [1,]  0.1  
 [2,]  0.2  
 [3,]  0.3  
 [4,]  0.4  
 [5,]  0.5  
 [6,]  0.6  
 [7,]  0.7  
 [8,]  0.8  
 [9,]  0.9

 > round(h9,4)  
        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  
 [1,] 1.0000 0.5000 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111  
 [2,] 0.5000 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000  
 [3,] 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909  
 [4,] 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833  
 [5,] 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769  
 [6,] 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714  
 [7,] 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667  
 [8,] 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667 0.0625  
 [9,] 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667 0.0625 0.0588

3.7 Singular value decompostion and pc regression

3.7.1 Ridge Regression

Ridge regression in R

>library(MASS)  
>help(lm.ridge)

(see Faraways chapter 9 )

Example of a matlab ridge regression function:

function bks=ridge(Z,Y,kvalues)  
% Ridge Function of Z (centered, explanatory)  
% Y is the response,  
% kvalues are the values where to compute  
[n,p]=size(Z);  
ZpY=Z'*Y;  
ZpZ=Z'*Z;  
m=length(kvalues);  
bks=ones(p,m);  
for k =1:m  
 bks(:,k)=(ZpZ+diag(kvalues(k)))\ZpY;  
end  
>> kvalues=(0:.05:.5)  
kvalues =  
  Columns 1 through 7  
         0    0.0500    0.1000    0.1500    0.2000    0.2500    0.3000  
  Columns 8 through 11  
    0.3500    0.4000    0.4500    0.5000  
>> ridge(Z,Y,(0:.05:.5))  
ans =  
  Columns 1 through 7  
    1.5511    1.5176    1.4882    1.4622    1.4390    1.4183    1.3996  
    0.5102    0.4775    0.4488    0.4234    0.4009    0.3806    0.3624  
    0.1019    0.0678    0.0378    0.0113   -0.0122   -0.0334   -0.0524  
   -0.1441   -0.1762   -0.2043   -0.2292   -0.2514   -0.2713   -0.2892  
  Columns 8 through 11  
    1.3827    1.3673    1.3532    1.3403  
    0.3459    0.3309    0.3172    0.3046  
   -0.0696   -0.0853   -0.0997   -0.1128  
   -0.3054   -0.3201   -0.3336   -0.3460  
%Formula gives for choice of k:  
>> norm(Yhat-Yc)^2  
ans =  
   47.8636  
>> 47.8636/(13-5)  
ans =  
    5.9829   % estimates the variance sigma^2  
>> bk0'*bk0  
ans =  
    2.6973  
>> k=(4*5.9829)/2.6973  
k =  
    8.8724     % This is a suggested value for k

3.7.2 Principal Components Regression

If we decompose the centered and rescaled matrix

Z = U SV ', with U 'U = I , V 'V  = I
                         n          p
Call the new variables the components: C = US, their norms arethe singular values: C'C = S2.
Y =  1nb0 + Zb +  e

3.7.3 Principal Components

A quick summary, all is detailed below:

Maximization of u'Au with u'Mu = 1

1) X centered, all points (observations) same weight.

                       sum r
1X  =  0  and   xij =    xitstvjt,
                      t=1
p variables can be replaced by the r columns of v.
        k
 (k)   sum 
xij =     uitstvjt
       t=1
is the best (optimal) approximation of rank k for X.

The columns of V are the directions along which the variances are maximal.

Definition: Principal components are the coordinates of the observations on the basis of the new variables (namely the columns of V ) and they are the rows of G = XV = US. The components are orthogonal and their lengths are the singular values C'C = SU'US = S2.

In the same way the principal axes are defined as the rows of the matrix Z = U'X = SV '. Coordinates of the variables on the basis of columns of U.

Z =  S- 1C'X   and   G  = XZ'S  -1.
However, this decomposition will be highly dependent upon the unity of measurement (scale) on which the variables are measured. It is only used, in fact, when the X.k are all of the same \order". Usually what is done is that a weight is assigned to each variable that takes into account its overall scale. This weight is very often inversely proportional to the variable's variance. So we define a dierent metric between observations instead of
                         '
d (xi.,xj.)  =  (xi.- xj.) ,(xi.-  xj.)
d (xi.,xj.)  =  (xi.- xj.),Q  (xi.-  xj.)
               (  -1          )
                  s21
        Q   =         ...       .
                           -1
                           s2p
The same can be said of the observations; some may be more \important" than others (resulting from a mean of a group which is larger).

The new model will be:

                                       Y  =   1nb0 + Cg  + e
                     Zb  = ZV  V'b = Cg

The best approximation  to rank  k of b :
                                      b+  =   V(k)^g(k)
                                       k
Taking V (k) to be the first k columns of V . The bias of ^b will be: V (p-k)V (p-k)b

Example with R
First rescale to have unit variance:

> apply(Z,2,sd)  
    5.8824   15.5609    6.4051   16.7382  
sweep(Z,2,sd)

Then do the singular value decompositions:

> svd(Zt)$d  
    5.1796    4.3489   1.4964    0.1396

> U=svd(Zt)$u;S=diag(svd(Zt)$d)  
> C=U%*%S;

3.8 More than I said on the Singular Value Decomposition

Singular Value Decomposition

Of numerical and statistical interest is the decomposition of X known as the s.v.d.

Up to now we have shown that X could be written X = UR, U orthogonal n × n R = [     ]
  R1
   0p × p upper trianglar. The transformation of y by U' is c = U'y + e, and the solution to the U's problem is provided by bs such that

E(ci) = R1 ^b.
Could we find a transformation of b such that it could be even simpler (diagonal):
    h  =  V 'b
  '         '               '
U  ^y1  =  E  (c1) = Sh  = SV  ^b
    ^y  =  X b^=  U SV 'b^
               '
   X   =  U SV  .
Theorem:
Let X be an arbitrary n × p matrix, n > p; then there exists Un×n orthogonal and V p × p orthogonal such that:
                                   (  s1  0   0  ...  0  )

        X   =   UnnSnpV p'p    S  =    0   s2  0  ...  0
                                      0   0   0  ...  sp

s1 > s2 >  ...  > sp > 0 ...0.

                (Induction see Golub  & Van  Loan, chapt  1).
The si, for i = 1,...,p are called the singular values of X.

Columns of U are left-singular vectors, columns of V right-singular vectors.

Remark: When X is symmetric positive semi-definite V = V and the singular values of X are its eigenvalues. On the other hand, X'X = V S2V '. The eigenvalues of X'X are the square of the singular values of X (and XX' --> U).

Statistically: E(Y ) = X^
b = USV '^
b = US^
h = XV ^
h . ^
h is the vector of regression coefficients on a new collection of p orthogonal regressors and these coefficients are rotated versions of the original p regression coefficients.

SE(^hi) = s/si  because   ^hi = U 'yi/si,
si small: then SE(^h i) large, which mean that the data provides little information about the regression coefficient on Xvi.

When is Si small?

       2   '  '        '  '  '    '      ' 2     2
||Xvi  || = viX  Xvi = viV S U U SV  vi = eiS  ei = si;
|si| is small i the squared length of Xvi is small; that is, if a partiuclar combination of columns of X is nearly zero, that is when there is near collinearity.

Courant-Fischer Theorem:
Xn×p, n > p and singular values s1 > s2...> sp > 0.

                                 ||u'X||-
sk  =   maxdim_O_=k   minu( - _O_     ||u||
    =   maxdim_O_=k  minu( - _O_   || u'X  ||      _O_ < Rn
                        ||u||=1                      p
    =   maxdimT =k   min v (- T   ||Xv  ||       T <  R
                        ||v||=1
If the si's are distinct and nonzero then u in 2) is uk = U.k and v = vk = V .k.

sk is then the largest possible value for the length of X.

V i is the vector for which the corresponding linear combination of the columns of X has greatest sum of squares. (d1 is its square root).

If the column means have been removed from X then this corresponds to the linear combination that has the largest variance.

max I(v'b : u'y)a s 12

h = (h1,...,hp) parameter of interest
log likelihood l(h) achieves its maximum value for l˙ (h) = @l(h)
 @h = 0p, where

           (                 )
  ˙l(h)  =    @l(h)-...@l(h)--  ,
              @h1       @hp
            @2l(h)
¨l(h)ij  =  ------- where ¨l(h) is a p × p matrix
           @hi@hj
for h to be a minimum in the vector case ¨l (h) must also be negative definite.

The Fisher information or information matrix for h is defined to be

I(h : x) = [- Eh(l¨x( )]h = Eh [˙lx( )][ ˙lx( )]'|h
under general cnditions when the true value of h is h0, the asymptotic variance matrix of the MLE ^h is given by the inverse of the information matrix.

e = Y - X^
b r = rank(X) k2(X) = d1/dr.

TSS = SSM + RSS
Total = Model + Residual
R2 = SSM/TSS = 1 - RSS/TSS.

Theorem: e = max(            )
  ||DX||- ||DY-||
   ||X || , ||Y|| < --1-
k(X).

^g estimate for the perturbed problem of reg. coe.

ê estimate of residually perturbed problem of reg. coe.

For R2 < 1

              (            V~ -------       )
|^b---^g|         2k1(X)--  --1---R2-      2        2
  |^b|    <   e    R     +    R     k2(X)    +  O(e ),

|^ell---l                                2
  |Y |    <   e(1 + 2k(X)(n -  p)) + O(e )
* function of square of the condition number of X.

3.8.1 Generalized Inverses

What if X is not full rank?
This will often happen when coding for analysis of variance.

X'X singular, the vector of coefficients b is not estimable although some linear combinations of b's are estimable. (Those for which I(c'b : Y ) = c'X'Xc are nonzero.

c'b estimable if there exists a vector l such that

E(l'Y )       =        c'b
   '                   '       '     '     *' '    *'   '
E(l Y )       =        lXb  = l V SV  b = l V  b  l  =  lV X
    c'b  estimable  i    c'=  l*'V '
(a) equivalent to c' of the form a'X.

(b) equivalent to c'(X'X)-X'c = cc'.

                        (                      )
                           -1
                           s1  .
                                ..
                                    1-
X+  =  U S+V '    S+  =             sr            .
                                        0  .
                                           ..
                                             0

                                        0
This solution is the one with the minimum squared length, is the same as setting hl+1 = hr+2 = ... = hp = 0. (Otherwise  oo )

X+ is the Moore-Penrose pseudo-inverse of X and if X n×p then Xp×n+ is defined as the unique matrix for which:

The solution ^b = X+Y is equivalent to ^b = (X'X)+X'Y .

3.9 Principal Components

3.9.1 Eigenvalue Analysis

Summary:

Example of the Power method with Matlab
function [vec,value]=power(start,A,toler)  
%  
%Power method for computing eigenvalues  
%  
dd=1;  
x=start;  
n=10;  
while dd> toler  
y=A*x  
dd=abs(norm(x)-n);  
n=norm(x)  
x=y/n  
pause  
end  
vec=x;  
value=n;

>> v1  
v1 =  
 
     1  
     1  
     1  
>> A  
A =  1     1     1  
     1     2     3  
     1     3     6

>> power(v1,A,1/10)  
y =  
     3  
     6  
    10  
dd =  
     8.2679  
x =  
    1.7321  
    3.4641  
    5.7735  
y =10.9697  
   25.9808  
   46.7654  
n = 6.9522  
x =  
    1.5779  
    3.7370  
    6.7267  
y =  
   12.0416  
   29.2320  
   53.1491  
n =  
    7.8552  
x =  
    1.5330  
    3.7214  
    6.7661  
y =  
   12.0205  
   29.2741  
   53.2940  
n =  
    7.8727  
x =  
    1.5269  
    3.7184  
    6.7695  

[v3     e3] = power(v1, A, 1/10)
v3 =  
 
    0.1939  
    0.4723  
    0.8598  
e3 =  
    7.8727  
>> [ve,ei]=eig(A)  
ve =  
    0.5438   -0.8165    0.1938  
   -0.7812   -0.4082    0.4722  
    0.3065    0.4082    0.8599  
ei =  
 
    0.1270         0         0  
         0    1.0000         0  
         0         0    7.8730  
 
 

Chapter 4
EM algorithm

Motivation:

A method for finding maximum likelihood estimates,
-either in presence of missing data.
-or when the model can be simplified by adding `latent parameters'.

Some references:

Hartley, 1958, Biometrics, 174-194, is a good starting place because he does simple examples.

Apparently the method was discovered by Oscar Rothaus while he was at IDA, with Baum, who then published several papers of which an example is Baum, Petrie, Soules and Weiss, (1970), Ann. Math. Stat., 164-171 who use the idea for speech recognition.

The main reference that people cite is Dempster, Laird et al. (1977), JRSS B - they coined the name EM.

It is hard to motivate with a `trivial example' because the the likelihood is easy to dierentiate.

Most of the elementary examples come from the multinomial likelihoods.

A very recent reference is the book called: The EM algorithm and Extensions, by G.J. McLachlan and T. Krishnan,(1997), Wiley.

Here is a first motivating example

Example 1:
Multinomial counts of animals into 4 categories

y = (y ,y ,y ,y ) = (125,18,20, 34)
      1  2  3  4
the underlying genetic model is believed to be:
 1    (1-  h)2 h(2-  h) h(2 - h)  (1 - h)2
(--+  --------,--------,--------, -------)
 2       4        4         4        4

Then we can consider that this is in fact a five-category problem with first two categories being split.

Call y = (1 - h)2, then the density is proportional to

        y1       y2       y3  y4
(2 + y)  (1 - y)  (1 - y)  y
So that the log likelihood is
l(y) =  logL(y)  =  y1 ln(2 + y) + y2 ln(1 - y) + y3 ln(1 - y) + y4 ln(y)
Dierentiating it gives:
@l      y        y        y     y
--- = ---1-- - ---2---  ---3--+ --4
@y    2 + y    1 - y    1-  y    y
        ---y1----  ---y2----  ---y3----   y4-
I(y) =  (2 + y)2 + (1 - y)2 + (1 - y)2 +  y2
A little symbolic verication:
 maple  
    |\^/|     Maple V Release 4 (WMI Campus WIde License)  
._|\|   |/|_. Copyright (c) 1981-1996 by Waterloo Maple Inc. All rights  
 \  MAPLE  /  reserved. Maple and Maple V are registered trademarks of  
 <____ ____>  Waterloo Maple Inc.  
      |       Type ? for help.  
> L:=psi->((2+psi)^y1)*((1-psi)^y2)*((1-psi)^y3)*psi^y4;  
                                y1          y2          y3    y4  
      psi -> L(psi) = (2 + psi)   (1 - psi)   (1 - psi)   psi  
 
> ell:=x->log(L(x));  
> ell(psi);  
                         y1          y2          y3    y4  
             ln((2 + psi)   (1 - psi)   (1 - psi)   psi  )  
 
> expand(");  
    y1 ln(2 + psi) + y2 ln(1 - psi) + y3 ln(1 - psi) + y4 ln(psi)  
> diff("",psi);  
                     y1        y2        y3      y4  
                   ------- - ------- - ------- + ---  
                   2 + psi   1 - psi   1 - psi   psi  
> diff(",psi);  
                   y1           y2           y3        y4  
             - ---------- - ---------- - ---------- - ----  
                       2            2            2      2  
               (2 + psi)    (1 - psi)    (1 - psi)    psi  
 
> Ifish(psi):=-";  
                          y1           y2           y3        y4  
        Ifish(psi) := ---------- + ---------- + ---------- + ----  
                              2            2            2      2  
                     (2 + psi)    (1 - psi)    (1 - psi)    psi  
 
> lprime:=diff(ell(psi),psi);  
> sols:= solve(lprime=0,psi);  
                                2                                     2  
 sols:=1/2 (y1 - y4 - 2 y3 - 2 y2 + (y1  + 6 y1 y4 - 4 y1 y3 - 4 y1 y2 + 9 y4  
 
                                 2                 2 1/2  
     + 12 y3 y4 + 12 y2 y4 + 4 y3  + 8 y3 y2 + 4 y2 )   )/(y1 + y2 + y3 + y4),  
 
                                    2                                     2  
    1/2 (y1 - y4 - 2 y3 - 2 y2 - (y1  + 6 y1 y4 - 4 y1 y3 - 4 y1 y2 + 9 y4  
 
                                 2                 2 1/2  
     + 12 y3 y4 + 12 y2 y4 + 4 y3  + 8 y3 y2 + 4 y2 )   )/(y1 + y2 + y3 + y4)

> y1:=125;y2:=18;y3:=20;y4:=34;  
> evalf(sols);                     .6268214980, -.5506793660  
> y1:=1997;y2:=906;y3:=904;y4:=32;  
> plot(ell(psi),psi=0..0.5);  
  +    AA  
  + AAA  AAAAAAA  
  +             AAAA  
1200                AAAA  
  +                     AAA  
  +                        AAA  
  +                           AAA  
1100                             AAA  
  +                                AAA  
  +                                   AAA  
  +                                     AAA  
  +                                        AA  
1000                                         AA  
  +                                            AA  
  +                                              AA  
  +                                                AA  
900                                                 AAA  
  +                                                   AA  
  +                                                     AA  
  +                                                       AA  
800                                                         AA  
  +                                                          AA  
  +                                                            AA  
  +                                                             AA  
  +                                                               AA  
700                                                                 A  
  +                                                                  A  
  +                                                                   AA  
  +                                                                    AA  
600                                                                     AA  
  +                                                                       A  
  +                                                                        AA  
  +--+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+--*  
  0    ^         0.1            0.2           0.3            0.4            0.5  
       |  
> evalf(sols);                           .0357123023, -.4668141517  
> y1:='y1';y2:='y2';y3:='y3';y4:='y4';

The solutions are:

                         V~ ---------------------------------------------------------------------------
    y1- y4- 2 y3- 2 y2 +  y12 + 6y1y4 - 4y1y3 - 4y1 y2 + 9 y42 + 12y3 y4 + 12 y2y4 + 4 y32 + 8y3 y2 + 4 y22
1/2 -------------------------------------------------------------------------------------------------
                                            y1 + y2 + y3 + y4
                         V~ ---------------------------------------------------------------------------
    y1- y4- 2 y3- 2 y2-   y12 + 6y1y4 - 4y1y3 - 4y1 y2 + 9 y42 + 12y3 y4 + 12 y2y4 + 4 y32 + 8y3 y2 + 4 y22
1/2 ----------------------------------------y--+-y-+-y--+-y------------------------------------------
                                             1    2   3    4
This is the negative solution that we won't be wanting.

Now, suppose we didn't have the equation solver available and that we want to nd the solution using the EM algorithm.

Split the rst cell into two cells with respective counts y11 + y12 = y1, then the new likelihood is proportional to:

      y11 y12       y2       y3  y4
y '-->  2   y   (1-  y)  (1 - y)  y
whose derivative is:
- y12 + y12y + y2y + y3 y - y4 + y4y
-------------------------------------
             y (- 1 + y)
This is the same as a binomial case with a sample size of y12 + y2 + y3 + y4, k = y12 + y4. The maximum likelihood estimate for y is:
y12 +-yy122-++-yy43-+-y4
 L:=psi->((2^y11)*(psi)^y12)*((1-psi)^y2)*((1-psi)^y3)*psi^y4;  
                          y11    y12          y_{2}          y3    y4  
             L := psi -> 2    psi    (1 - psi)   (1 - psi)   psi  
 
> ell(psi);  
                  y11    y12          y2          y3    y4  
              ln(2    psi    (1 - psi)   (1 - psi)   psi  )  
 
> diff(",psi);  
> simplify(");  
                -y12 + y12 psi + y2 psi + y3 psi - y4 + y4 psi  
                ----------------------------------------------  
                                psi (-1 + psi)  
 
> solve("=0,psi);  
                                   y12 + y4  
                              ------------------  
                              y12 + y2 + y3 + y4  

logLc(y) =  (y12 + y4)logy + (y1 + y3)log(1 - y)
      (0)
Q(y; y   ) = Ey(0)[logLc(y)|y]
logLc(y)is a linear function of the unobservables y11 and y12, we replace them with their current conditional expectations given the observed y.
                1
Y11 ~ B(y1, 1---21--(0))
            2 + 4y
So that
                  12 y1
Ey(0)(Y11| y1) = -1---1-(0)
                2 + 4y
This is all that is needed at this step as y12(0) = y 1 - y11(0).

The maximisation step doesn't need any more computations, as now we have values for the unobsevrables that we just plug into

     y12 + y4
------------------
y12 + y2 + y3 + y4
giving:
  (1)   -----y(10)2-+-y4-----   y(102)+-y4-
y   =   (0)               =       (0)
       y12 + y2 + y3 + y4   n -  y11

4.0.2 Implementation

Here is the little R function that I used to check the computations:

emalgo=function(y1,y2,y3,y4,tol=0.0005,start0=0.2)  
{  
#EM algorithm for Rao's multinomial model  
#  
#y1,y2,y_3,y_4 are the observed frequencies  
#  
#Initializations  
psilast=0;  
n=y1+y2+y3+y4;  
psicurrent=start0;  
psi=psicurrent;  
while (abs(psilast-psi)>tol ){  
   v=estep(psicurrent,y1)  
   y11=v[1]  
   y12=v[2]  
   psi=mstep(y12,y11,y4,n)  
   psilast=psicurrent  
   psicurrent=psi  
   cat('psicurrent=',psicurrent,'\n')  
}}  
 
 
mstep=function(y12,y11,y4,n){  
return((y12+y4)/(n-y11))}  
 
estep=function(psicurrent,y1){  
 y11=(.5*y1)/(.5+psicurrent/4);  
 y12=y1-y11  
return(c(y11,y12))}  

Results:

 tol=10^(-7);  
 start0=.05;  
emalgo(y1,y2,y3,y4,tol,start0)  
a=emalgo(y1,y2,y3,y4,tol=10^(-7),start0)  
psicurrent= 0.4936627  
psicurrent= 0.6072183  
psicurrent= 0.6241805  
psicurrent= 0.6264701  
psicurrent= 0.6267748  
psicurrent= 0.6268153  
psicurrent= 0.6268207  
psicurrent= 0.6268214  
psicurrent= 0.6268215

4.0.3 Estimating Mixture Proportions

Suppose the density is of the form:

           sum g
f(w, y) =     pifi(w)
          i=1
where only the mixing proportions are unknown, the actual densities fi are supposed to be completely specied. The unknown parameter is g - 1 dimensional: y = (p1,p2,...,pg-1) y = (w1,...wn) is observed from the mixture. The loglikelihood from the observed is:
            n                 n    (   g         )
            sum                 sum         sum 
logL(y) =     logf (wj;y) =     log      pifi(wj)
           j=1               j=1       i=1
We dierentiate logL(y) with regards to pi and then we have to nd the likelihood equations:
 sum n
   {-fi(wj)---  fg(wj)--}=  0
j=1 f (wj; y)    f(wj;y)
We can't give a closed form solution. We introduce the dummy variables: z = (z1,...zn) , where each zi is a binary vector of length g taking on the value 1 at the coordinate of the group it belongs to.

If we observed the z's, the mle of pi would be ^pi = zi.
 n.

Take the new, complete data vector to be x = (y,z).

                sum   sum              sum   sum 
If logL (y) =         z  logp +         z logf (w )
        c              ij               ij   i   j
                g   j            g   j
the second term on the right does not contain pi, we ignore it.

The E-step:

Ey(k)(Zij|y)  =  proby(k)(Zij = 1| y)

             =   proby(k)(Zij =-1-&--y)-
                      proby(k)(y)
                 prob (k)(y|Z  =  1)prob (k)(Z  =  1)
             =   ----y------ij---------y----ij------
                             proby(k)(y)
                  (k)
             =   pi--fi(wj)--= z(ikj)
                 f(wj;y(k))
The M-step:
Just act as if we knew the zij to be zij(k), then we have the maximum of the likelihood at:
      (k)
     zi.-
^pi =  n

4.0.4 EM for exponential families

Suppose the complete density is of the form:

gc(x;y) = b(x)e(y)'t(x)/a(y)
We will suppose the model of full rank, ie the sucient statistic t(x) to be of the same dimension as y.
            @loga(y)
Ey[t(X)] =  ---------
               @y
Q(y, y(k)) = y'Ey(k)[t(X)|y]-  loga(y)
Dierentiate with regards to y and you nd that to maximise Q you must take:
Ey[t(X)]  = t(k) = Ey(k)[t(X) |y]
so we need to solve:
@loga(y)@y  = Ey(k)[t(X) |y]

As an application, you can consider any multinomial as an exponential family, (see Lehmann, Theory of Point Estimation, page 28), so that they all enter this case.

A particular example was the one I did the rst day where the new parameter space had a natural interpretation, that of the gene counting example, where the rst mutinomial is the observed one of the frequencies of the observed phenotypes. y = (nA,nB,nAB,nO), the parameter of interest is the probability of each of the genotypes A,B,O = (p,q,r), we can usethe augmented model: x = (nAA,nBA,nBB,nOO)

Here is a link to some very serious papers about tomography: http://www.stats.bris.ac.uk/pub/reports/TOMO.

And here is a link to a nice application of the EM to Bayesian self-organizing map simulations: http://www.aist.go.jp/NIBH/ b0616/Lab/BSOM1/

Chapter 5
Monte Carlo Simulations

5.1 Monte Carlo methods

There is not necessarily a random component in the original problem that one wants to solve,usually a problem for which there is no analytical solution. An unknown parameter (deterministic) is expressed as a parameter of some random distribution, that is then simulated. The oldest well-known example is that of the stimation of p by Buon, in his needle on the oorboards expeiment, where supposing a needle of the same length as the width between cracks we have:

p(needle crosses crack) = 2- implying  ^p =  2#tries-
                          p                 #hits
In physics and statistics many of the problems Monte Carlo is used on is under the form of the estimate of an integral unkown in closed form:
     integral 
       1
h =     f(u)du  which can be seen as the evaluation of Ef (u), where u ~  U(0,1)
      0
  1. The crude, or mean-value Monte Carlo method thus proposes to generate B numbers uniformly from (0, 1) and take their average: to estimate h,
            B
^   -1  sum 
h = B      f(ub)
       b=1
  2. The hit-or-miss Monte Carlo method generates random points in a bounded rectangle and counts the number of 'hits' or points that are in the region whose area we want to evaluate.
    h^=  #hits--
     #total

Which estimate is better?
This is similar to comparing statistical estimators in general. There are certain desirable properties that we want estimators to have, consistency which ensures that as the sample size increases the estimates converges to the right answer is ensured here by the properties of Riemann sums. Other properties of interest are:

Both the above methods are unbiased, that is when repeated many times their average values are centred in the actual value h.

E(^h) = h
So the choice between them lies in nding the one which has the less variance.

More precisely, the variance of crude Monte Carlo is

          integral  1                           2                  2
  2    1-             2     E(f-(u)---h)-    1-      2    h--
s M =  n  0 (f(u) - h) du =       n       =  nE(f (u) )-  n
and that of hit and miss Monte Carlo, which is just a Binomial(n,h) is:
s2  = h(1---h)
  H       n
The dierence between these two variances is always positive:
               integral  1
 2     2    1-
sH -  sM =  n  0  f(u)(1-  f(u))du > 0
Most improvements to Monte Carlo methods are variance-reduction techniques.

5.1.1 Antithetic Resampling

Suppose we have two random variables that provide estimators for h, X and Y , that they have the same variance but that they are negatively correlated, then 12(X + Y ) will provide a better estimate for h because it's variance will be smaller.

This the idea in antithetic resampling (see Hall, 1989). Suppose we are interested with a real-valued parameter, and that we have ordered our original samplex1 < x2...xn, for each resample X* = {x j1,xj2,...,xjn} and statistic s(X*) we associate X** by taking a special permutation of the ji's that will make cov(X**),s(X*)) < 0, and as small as possible. If the statistic is a smooth function of mean for instance, then the 'reversal permutation' that maps 1 into n, 2 into n - 1, etc... is the best, the small sample values are transformed into the larger observations, and the average of these two estimates will give an estimate with smaller variance.

5.1.2 Importance Sampling

This is often used in simulation, and is a method to work around the small area problem. If we want to evaluate tail probabilities, or very small areas, we may have very few hits of our random number generator in that area. However we can modify the random number generator, make that area more likely as long as we take that into account we we do the summation. Importance sampling is based on the equalities:

 integral  1           integral  1               integral  1
                  f-                f-
  0 f(u)du =   0  g(u)g(u)du =   0  g(u)dG(u)

5.2 Random Number Generation- Uniform[0, 1]

Try the following web site pages: http://www.maa.org/mathland/mathland_4_22.html http://stat.fsu.edu/~geo/diehard.html

And read chapter 7 of Numerical Recipes.

Motivation:
Random numbers are used:
1- To solve problems that are too complex (stochastic algorithms for NP complete problems)

2- For optimization (simulated annealing, genetic algorithms).

3- System Simulation (physics and biology, econometrics...)

4- Non-parametric tests, bootstrap, permutation tests.

5- Quadrature methods using Monte Carlo.

Solving a deterministic problem by rewriting it as equivalent to a problem with a stochastic component.

Little Example as a reminder of what Monte-Carlo is:

     integral  1
h =    f (t)dt
     0
If this exists it can be seen as E(f(X)) where X has a uniform distribution on [0, 1]. So we approximate it with
^h =  1- sum N f(Xi)
     N  i=1
for a sample of size N of uniform random variables.

Most methods, even if they need non-uniform random variates, as in simulation, still use uniform random variables as their starting points.

So we will spend some time explaining the basics of random numbers.

The machine being deterministic, there is denitely a problem, we will never get purely random numbers, they will only look so.

(We already saw that there are some pretty bad ones, when we did the multidimensional visualization of the IBM randu generated numbers with xgobi).

Also beware of srand, the basic unix one is bad.

5.2.1 Congruential Methods

xi  =_  (axi-1 + c) (mod m)
Lehmer (1951) They will be integers in [0,m], so we have to retransform them back by dividing by m

The way most of the algorithms work is they use an initial seed and a simple recursion to feed out the next in sequence, which will also serve as a seed .. etc.

As soon as one hits a number that has already been used we will be having a sequence that was already seen. -Not so good-

Period of the sequence= minimal length of a subsequence that by repetition will form the whole sequence.

Period of the generator will be the largest period of any seed for a given generator.

We need to choose the parameters so as to have a maximal period.

Proposition:
The congruential generator has a period of m i:

  1. c is relatively prime to m.
  2. a mod 1(mod p) for any prime factor p of m
  3. a mod 1(mod 4) if 4 is a factor of m.

Actual choices:
They have to take into account both theoretical and practical aspects:

  1. Repeatability
  2. Portability
  3. Feasability
  4. Non-predictability

5.2.2 Other `Better?' Methods

Substract with Borrow

Instead of a single seed, the random number generator has a state: (35 words of memory), 32 fp numbers z between 0 and 1. The others make up an integer index : i, which varies between 1 and 32, and integer j and a 'borrow' ag.

j is used to initialize.

zi = zi+20- zi+5 - b
The indices are computed modulo 32.

zi = fp(zi) xor j b is the borrow, small or zero. Complexity Justified Methods
Random number generation is akin to cryptography, knowing that factorising a number into its prime factors is a very HARD problem (computationally) allows one to think up various `multiplicative schemes', for instance:

  1.            2
Yn = (Yn-1)  (mod pq), f or p,q primes
  2.                            32
Xn  = Xn -24 *Xn -55 (mod2   )

and back to tables but on CDs now (new Numerical Recipes).

5.2.3 Random Number Generation- other than Uniform

Luc Devroye, Non-Uniform Random Number Generation, Springer Verlag.
A very useful home page is Luc's with links to places where you can download random number generator packgages etc... http://jeff.cs.mcgill.ca/~luc/ Web pages with good links to papers on random number generation, and encryption packages:
http://www.cs.berkeley.edu/~daw/netscape-randomness.html

5.2.4 Inversion Method

Based on the inverse CDF property :
X = F-1(U) ~ F where U is uniform:

P (F -1(U) < c) = P (U <  F(c)) = F (c)

This means that when we have a closed form formula for the inverse of the cdf which is not too expensive, we can use a random uniform and do the transformation that makes it have the correct density.

Example 1: Exponential

f (x) = ce-cx,x >  0,F (x) =  1-  e-cx,F -1(U) = - 1-log(1 -  U)
                                                  c
One can simplify this by remarking that U and 1 - U have the same distribution. Generate -1
clog(U) Other distributions that have nice closed form expressions are the Cauchy, Rayleigh, Triangular, Pareto..

Example 2: Correlated Random Variables
F-1(U) and G-1(U) have maximum positive correlation and have the distributions F and G, for negative correlation just take G-1(-U).

Example 3: Maxima
(Fn)-1(U)

If closed forms are not available, but a good numerical solver is, then you can use that.

5.2.5 Rejection Methods

Theorem: Suppose f is a non zero density on [a,b] and zero outside. (a,b may be innite). Suppose we know a function M(x) which majorizes f: M(x) > f(x) on [a,b] and take:

        ---M-(x)---
m(x) =   integral  b
         a M (x)dx
so that m is a probability density. We will choose M to generate from, so it must be easier than f to generate from. For instance, for [a,b] bounded we can take the uniform on [a,b].

Algorithm:

  1. Generate T with density m.
  2. Generate U uniform on [0, 1], independently of T {
   If M (T )×  U <  f(T)accept T, ie Z =  T
   Otherwise  reject T, go back to step 1

This is why it works:

                          P(z < Z  < z + dz)  =   P (z <  T < z + dz |accept )
                                                  P-(z <-T-<--z +-dz-accept)
                                              =           P (accept)

                                              =   P-(accept| z-<--T-<-z-+-dz)P-(z-<-T-<--z +-dz)
                                                                   P (accept)
                                                          f(z)
             First P (accept| z <  T < z + dz)  =   P (U  <  -----)
                                                          M (z)
                                                  -f(z)-
                                              =   M (z)

P (accept|z < T <  z + dz)P (z <  T < z + dz)  =   m(z)dzf--(z)-=   integral -f(z)dz--
                                                     M  (z)        b M (x)dx
                                                   integral  b            a
             and P (accept) = P (U  < -f(T-))  =       f(t)-m(t)dt = -----1----
                                     M  (T)        a  M (t)          integral  bM (t)dt
                                                                      a

We need more than one uniform to generate one random variable with the right distribution.

How many are needed? This will depend on c =  integral abM(t)dt.

The number of necessary trials before accepting will be a geometric with p = 1
c.

Example 1:
Suppose f is a bounded density on a compact support, f(x) < M take c = M(b - a),
Repeat the folowing:

Until UM < f(X)
Return X.

This is usually very inecient, it is much better to take a bounding function that `ts'.

Example 2:
Simulating a normal using an exponential as a bounding function: we use the fact that the absolute value of a normal has density:

f(x) =   V~ 22p-e- x22
M(x) =  V~ -----
  2e/pe-x majorizes f, take c =  V~  -----
   2e/p, we can generate from the density m(x) easily because we know how to generate from an exponential of rate 1.

f (x)          1       2
------=  exp(- -(x - 1) )
M (x)          2

5.2.6 Table Lookup

If the probabilities are rational numbers and the variable discrete, or well approximated by a discrete distribution you can use table lookup, which is very expensive to set-up but can be useful if you a very a very large number of random variables to generate:

Algorithm:

Examples : histogram data.

5.3 Specialized Methods

5.3.1 Polar methods for the Normal

Box - Muller
Fact:
If X and Y are independent normals then their polar coordiantes R =  V~ --2----2-
  X  +  Y, Q = arctan(Y/X) are independent, R2 ~ expon(2), Q ~ U(0, 2p).

Modified to short-cut cosines and sines:
Generate a point uniformly in the disk(1): (V 1,V 2).
Call S = V 12 + V 22,

Return X =  V~ --2loSgS-V 1,Y =  V~  -2loSgS-V 2

5.4 Importance Sampling

See notes at:
http://www.ma.utexas.edu/documentation/nr/bookcpdf/c7-8.pdf (Chapter 7 of Numerical Recipes)
or a nice tutorial here: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf

Chapter 6
The Bootstrap, Permutation Tests, Simulation

6.1 Motivation

We want to compute bias, variability, condence intervals for estimates for which the theory doesn't provide closed form solutions.

Examples :

  1. The parameters in non parametric regressions (smooths, ppreg,...)
  2. Correlation coecients for samples that don't necessarily come from a Normal distribution.
  3. Testing when we don't know the distribution of the test statistic under the null hypothesis.

Suppose we are interested in the estimation of an unknown parameter h (this doesn't mean that we are in a parametric context). The two main questions asked at that stage are :

The second question has to be answered through information on the distribution or at least the variance of the estimator.

Of course there are answers in very simple contexts: for instance when the parameter of interest is the mean m then the estimator X has a known standard deviation : the estimated standard error noted sometimes

    [         2  2]
^s =  (Xi - X)  /n
However no such estimator is available for the sample median for instance.

In maximum likelihood theory the question 1 is answered through using the mle and then the question 2 can be answered with an approximate standard error of

^h =  V~ F-ish1erInf-o

The bootstrap is a more general way to answer question 2 , with the following aspects:

If we had several samples from the unknown (true) distribution F then we could consider the variations of the estimator :

    Random   Sample      1   1    1          1  ^
F         - -->         (X 1,X 2...X n)   =   Xn   h1
    ...                 ...                ...   ...
    Random   Sample
F         - -->         (X21,X22...X2n)   =   Xn2  ^h2
    .                 .                .   .
    ..                 ..                ..   ..
    Random   Sample
F         - -->         (XB1 ,XB2 ...XBn ) =   XnB  ^hB
Such a situation is never the case, so we replace these new samples by a resampling procedure based on the only information we have about F, and that is an empirical F^n :this side is what is called bootstrapping.

6.2 The bootstrap : the univariate context

6.3 In practice

From an original sample

                    iid
Xn  = (X1, X2...Xn)  ~ F
draw a new sample of n observations among the original sample with replacement, each observation having the same probabilty of being drawn (= 1
n). The bootstrap sample is often denoted
  *     *   *    *iid
X n = X 1,X 2...X n~  Fn the empirical distribution

If we are interested in the behaviour of a random variable h= h(X,F), then we can consider the sequence of B new values obtained through computation of B new bootstrap samples.

Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability through a function denoted by RANINT(1,n) hereof.

Here is an example of a line of matlab that does just that:

T= floor((rand(1)*n)+1);

If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).

sample(n,n,replace=T)

An approximation of the distribution of the estimate h= h(Xn,F) is provided by the distribution of

 *b       *b
h  =  h(Xn ,Fn),  b = 1..B
Gn*(t) = P Fn( *    )
 h  < t denotes the bootstrap distribution of h *, often approximated by
 *          *
Gn(t) = #{h   < t}/B
The Bootstrap Algorithm

  1. For b=1 to B do : /* B is the number of bootstrap samples */
    1. For i=1 to n do : /* Generation of Sample Xb* */
      1. OBS=RANINT(1,n)
      2. Add observation number OBS to Sample Xb*
    2. Compute h b* = h(X b*)

An example data set is LS (or laws) : the law school data consisting in 15 bivariate data points, each corresponding to an American law school as analysed in Efron 1982:

x  = (LSAT   ,GP A  )
 i          i      i
LSATi is the class average score on a nationwide exam, GPAi is the class average undergraduate grades.

In order to repeat a large number of times the same instructions needed to construct B data sets one usually has to use a looping facility, actually matlab and S are easier to program in this circumstance because one can use ordinary loops and use owner-created functions, as well as trying to benet form the vectorized commands.

6.4 S examples for simple bootstraps

laws<-scan()  
576 3.39  
635 3.30  
558 2.81  
578 3.03  
666 3.44  
580 3.07  
555 3.00  
661 3.43  
651 3.36  
605 3.13  
653 3.12  
575 2.74  
545 2.76  
572 2.88  
594 2.96  
 
 
"boot.simple"<-  
 
function(data, fun, nboot, ...)  
{  
        n <- nrow(as.matrix(data))  
        res.boot <- vector("list", length = nboot)  
        indb <- matrix(sample(n, size = n * nboot, T), nrow = nboot)  
        for(b in (1:nboot)) {  
                res <- fun(indb[b,  ], ...)  
                plot(res, xlab = "", ylab = "", lab = c(0, 0, 0))  
                res.boot[[b]] <- res  
        }  
        return(res.boot)  
}  
 
"boot"<-  
 
function(ind, nboot, fun, ...)  
{  
# Tibshirani's efficient Bootstrap  
# If fun is function of a matrix or returns multiple values  
# it has to be transformed before  
 
        data <- matrix(sample(ind, size = length(ind) * nboot, replace = T),  
                nrow = nboot)  
        return(apply(data, 1, fun, ...))  
}  
 
#Elementary example of using boot  
 
corr<- function(x, xdata)  
{  
        cor(xdata[x, 1], xdata[x, 2])  
}  
j1000 <- boot(1:15, 1000, theta, laws)  
hist(j1000 - 0.777)  
q.bt <- quantile(junk1000, probs = c(0.01, 0.99))  
#With greek label  
hist(junk1000 - 0.777, nclass = 40, xlab = "q* - q", font = 13)  
 
#Example of Bootstrapping a Mann-Whitney statistic  
"mw"<-  
function(x, y)  
{  
#Function that computes a Mann-Whitney statistic for two samples  
        n <- length(x)  
        m <- length(y)  
        rks <- rank(c(x, y))  
        wx <- sum(rks[1:n])  
        ux <- n * m + (m * (m + 1))/2 - wx  
        mwx <- 1 - ux/(m * n)  
        return(mwx)  
}  
 
"mwbt"<-  
function(x, y, nboot = 100, fun = mw, ...)  
{  
        bts <- rep(0, nboot)  
        for(b in (1:nboot)) {  
                xbts <- sample(x, length(x), replace = T)  
                ybts <- sample(y, length(y), replace = T)  
                bts[b] <- fun(xbts, ybts, ...)  
        }  
        return(bts)  
}  
 
"mwxy"<-  
function(xy, n)  
{  
#       n <- length(x)  
        m <- length(xy) - n  
        rks <- rank(xy)  
        wx <- sum(rks[1:n])  
        ux <- n * m + (m * (m + 1))/2 - wx  
        mwx <- 1 - ux/(m * n)  
        return(mwx)  
}

6.5 Parametric bootstrap

If one has a parametric distribution Fc which is well dened except for the pamameter c, instead of drawing with replacement from the original sample, one can draw from the distribution Fc, where c is the estimation obtained from the original sample.

#Parametric Bootstrap  
"boot.par"<-  
function(model.sim = law.sim, n = 15, nboot = 66)  
{  
#Parametric bootstrap of correlation coefficient  
#Law sim contains bivariate normal data  
 
        theta <- rep(0, nboot)  
        for(b in (1:nboot))  
                theta[b] <- corr(((b - 1) * n + 1):(b * n), law.sim)  
        return(theta)  
}  
 
 
#Example of using the functions and plotting the results  
 
n1 <- rnorm(1000)  
n2 <- rnorm(1000)  
law.sim <- cbind(n1, 0.777 * n1 + sqrt(1 - (0.777^2)) * n2)  
rbootpar <- boot.par()  
hist(rbootpar)  
par(new=T)  
plot(density(rbootpar, width = 0.15), xlab = "", ylab = "",  
 lab = c(0, 0, 0),type = "l")  
title("Parametric Bootstrap - Law School corr.", cex = 1.9)

6.6 Smoothed Bootstrap

An intermediate solution between parametric and nonparametric bootsrapping is the smoothed bootstrap.

Instead of resampling directly from the empirical distribution Fn, one smoothes it out rst then the smoothed version is used to generate the new samples.

If a kernel smoother is used, for instance a Gaussian kernel, then generation from the smoothed distribution Fn is easy since it is sucient to resample with replacement and then perturb the sampled point by adding a small gaussian random variable (Silverman B.W. & Young G.A.)

#Smoothed Bootstrap  
 
"boots"=function(x,nboot=100,fun,kernel="norm",h=1)  
{  
#Function doing a smoothed bootstrap with either  
# a normal or an epaneichikov kernel  
#h is the window width (can be tricky to choose)  
 
n=nrow(as.matrix(x))  
theta=rep(0,nboot)  
for (k in (1:nboot))  
{  
if(kernel=="norm") eps=rnorm(n,mean=0,sd=1)  
else eps=epan(n)  
xb=x[sample(n,n,T)]+eps*h  
theta[k]=fun(xb)  
}  
return(theta)  
}  
 
 
"epan"=function(n)  
{  
eps=rep(0,n)  
v=matrix(runif(3*n,-1,1),nrow=n)  
for(i in(1:n))  
{  
if ((abs(v[i,3])>=abs(v[i,2])) &&  
(abs(v[i,3])>=abs(v[i,1])))  
                          eps[i]=v[i,2]  
                     else eps[i]=v[i,3]  
}  
return(eps)  
}

6.7 Dierent Uses for bootstrapping

6.7.1 Quality of Estimates

Bias :

BIAS   = EF (h(Fn)) - h(F )
If we take the statistic R(X,F) = h(Fn) - h(F), (where Fn denotes the empirical distribution function) and apply the bootstrap algorithm to its expectation then we obtain the estimate :
                     B
BIAS  =  E R*  = -1  sum  (h*b- h) = h*• - h
          *      B
                    b=1
where the dot indicates averaging :h *• =  sum b=1Bh *b/B

Variance :

                               1
       {        sum B           } 2
SD   =   --1---    [h*b-  h*•]2
         B -  1 b=1

Confidence Intervals : The percentile method As before we denote by Gn*(t) = P Fn(h* < t) the bootstrap distribution of h *, approximated by

 *          *
Gn(t) = #{h   < t}/B

The percentile method consists in taking the 1 - 2a condence interval for h as being

[  *-1      *- 1       ]
 G n  (a),G n  (1 - a)
Theoretically speaking this is equivalent to replacement of the unknown distribution G(x,F) = PF (h< x) by the estimate G(x,Fn).
The Percentile Algorithm

:

  1. Input the level = 2a of the confidence interval
  2. For b=1 to B do /* B is the number of bootstrap samples*/
  3. For i=1 to n do /* Generation of Sample Xb */
    1. OBS=runif(1,n)
    2. Add observation number OBS to the Sample Xb
  4. Compute h b* = h(X b)
  5. Combine the h b* = h(X b) into a new data set THETA
  6. Compute the a and (1 - a)’th quantiles for THETA.

6.8 Enhancements

As can be predicted, the bigger B, the number of bootstrap resamples, the better the approximations are, thus much recent work has been devoted to making the most of the computations :

Preliminaries : Pivotal quantities
Construction of condence intervals is based (ideally) on a pivotal quantity Rn, that is a function of the sample and the parameter h whose distribution is independant of the parameter h, the sample, or any other unknown parameter. Thus for instance no knowledge is needed about the parameters m and s2 of a normal variable X ~ N(m,s2) to construct condence intervals for both of these parameters, because one has two pivotal quantities available :

X-- V~ --m-~ t
S/  n     n-1
(n - 1)S2
-----2----~ x2n-1
   s

Let cn(a) denote the largest (1 - a)th quantile of the distribution of the pivot Rn, and Q the parameter space, set of all possible values for h then

{t  (-  Q : Rn(t) < cn(a)}
is a condence set whose level is 1 - a. Thus if t(1-a/2), xa/22 and x 1-a/22 denote the relevant quantiles for the distributions cited above the corresponding 1 - a condence intervals will be
[              V~  -                 V~ -  ]
 x - t(1-a/2)s/   (n),x + t(1-a/2)s/  (n)
and
[        2  2             2  2  ]
 (n -  1)s/x 1-a/2,(n - 1)s /xa/2

Such an ideal situation is of course rare and one has to do with approximate pivots, (sometimes called roots). Therefore errors will be made on the level (1 -a) of the condence sets. Level error is sensitive to how far Rn is from being a pivot.

We will present several approaches for nding roots One approach is to studentize the root that is divide it by an estimate of its standard deviation. Another is to use a cdf transform constructed from a rst set of bootstrap simulations as explained below.

6.9 Bootstrap-t : Studentizing

Suppose that we are interested in a parameter h, and that h is an estimate based on the n-sample Xn.

We will suppose that the asymptotic variance of h is s2/n with a corresponding estimate s2/n.

The studentized bootstrap or bootstrap-t method (Efron 1982, Hall 1988) is based on approximating the distribution function HF (x) of n1
2 ( h- h)/s by HFn(x), distribution of n1
2 (h*-h )/s*.

Denote by cn,a* the a th quantile of the distribution of n12 (h*-h ),

c*   = H -1(a)
 n,a     Fn
Thus the equitailed bootstrap-t condence interval of coverage 2a is :
[                              ]
  h- n -12cn,(1-a)s,h - n -12cn,as
One way of nding an estimate s* is to use a second level bootstrap, this is rather expensive compuation-wise : one could prefer as an intermediate solution a jackknife estimate.
Bootstrap-t Algorithm
  1. Compute the original estimate of h : THETA(0)
  2. For b=1 to B do :
    1. Generate a sample Xb*
    2. Compute the estimate of h based on Xb*: THETA(b)
    3. Use Xb* as original data and compute an estimate of the standard deviation of h : SIGMA(b).
    4. Compute R(Xb*)=(THETA(b)-THETA(0))/ SIGMA(b)
  3. Compute CNA and CNCA the a and (1 - a)’th quantiles for the sample formed by all the R’s.
  4. LO=THETA(0)-SQRT(N)*CNCA*SIGMA(0)
  5. HI=THETA(0)-SQRT(N)*CNA*SIGMA(0)

6.10 R functions

bootsd =function(x, theta, nbootsd = 25)  
{  
        n <- length(x)  
        boo <- boot(x, fun = theta, nboot = nbootsd)  
        return(sqrt(((n - 1)/n) * var(boo)))  
}  
 
bootst= function(x, fun = median, nboot = 100, nbootsd = 25)  
{  
# Studentized Bootstrap  
        boo <- rep(0, nboot)  
        for(b in (1:nboot)) {  
                boo[b] <- bootsd(x, theta = fun, nbootsd = nbootsd)  
        }  
        thetab <- boot(x, fun = fun, nboo = nboot)  
        theta0 <- fun(x)  
        thetasd <- ((thetab - theta0)/boo)  
        return(thetasd)  
}  
 

6.11 Preliminary Transformations of the Estimate

One of the drawbacks of the above-dened method is that the variance estimate can tend to be quite erratic. In the particular case of bootstrapping a correlation coecient r one sometimes ends up with condence intervals that contain [-1,1], which is rather disappointing.

However experiences have shown that preliminary transformation of r such as Fishers z transform (=arctanh r), followed by the Bootstrap-t and then retransformation of the condence interval back to the original scale leads to much more dependable condence intervals. Unfortunately such a handy transform is not always available but R. Tibshirani, 1988 has suggested an algorithm for creating automatically dened variance stabilizing transformations.

6.12 Pre-pivoting or Double Bootstrap

Based on the inverse CDF property :

i.e. that if X is distributed according to F then F-1(X) is uniform.

Denote by Hn = Hn(.,F) the c.d.f. of our root Rn(h) which is supposed to converge weakly to the c.d.f. H = H(.,F) as n increases.

As we have seen the idea behind the bootstrap is to replace the unknown distribution function F by its empirical counterpart Fn, thus using Hn = Hn(.,Fn) as an estimate for the unknown cumulative distribution. Once validation of the procedure has been insured through verifying that

H   L==>  H
  n
Then
Hn(Rn(h))  ==>L U (0, 1)
 ---- ----
   Rn,1
In fact it has been shown that the distribution of Rn,1 is less dependent on F than is that of Rn(h), thus providing a better pivot.
The prepivoting/double bootstrap algorithm

:

  1. For b=1 to B do
    1. Generate sample Yb* a bootstrap sample of size n from F n
    2. Compute R(b)=Rn(Yb*,h )
    3. Initialize Z(b)=0
    4. For k=1..K do
      1. Generate sample Xk* a bootstrap sample of size n from F n,b*
      2. Compute Rn(Xk*,h*b)
      3. If Rn(Xk*,h*b) < Rn(Yb*,h ) add one to Z(b)
    5. Divide Z(b) by K.
  2. Compute CN1 :the (1 - a)’th quantile of the Z’s
  3. Compute the CN1’th quantile of the R’s.

6.13 Balanced Bootstrap : saving computations

Davison, Hinkley and Schechtmann, 1986 introduced the idea that one could reduce the amount of simulations (=B) necessary to attain a given precision by using each of the sample observations exactly equally often.

The algorithm proposed by the authors was however rather expensive timewise ( oc nB log nB). We have a more economical one implemented proposed by Gleason .

The Balanced Bootstrap Algorithm

:

  1. Form the list L = {I, I,I..I}
  2. For i = n * B to 2 do
    1. j = runif(1,i)
    2. Exchange Lj and Li

6.14 Antithetic Sampling : more on saving computations

The idea comes from current monte-carlo techniques, and is detailed in unpublished work by Hall. Like other enhancements presented here, its objective is to reduce the estimate's variance.

Let h 1 and h 2 be two dierent estimates of an unknown parameter h, if h 1 and h 2 can be chosen so as to be negatively correlated, then h = 12(h 1 + h 2) has half the variance of h 1 and h2.

The bootstrap version of this is based on an antithetic permutation of the sample. B ordinary resampling operations then supplies the equivalent of 2B resampling operations.

6.15 Bootstrap and Randomization Tests

6.15.1 An example : Infants walking

(from Daniel W.W., p.216) :
Researchers wished to know if they could conclude that two populations of infants dier with respect to the mean age at which they walked alone. The following data (ages in months) were collected:
Sample from population A:
9.5, 10.5, 9.0, 9.75, 10.0, 13.0, 10.0, 13.5, 10.0, 9.5, 10.0, 9.75
Sample from population B:
12.5, 9.5, 13.5, 13.75, 12.5, 9.5, 12.0, 13.5, 12.0, 12.0
What should we conclude ? (a = 0.05)
The classical approach : a t-test

The software output from script in the appendix shows :
The mean for population A is : 10.375
The mean for population B is : 12.208
The dierence is : 1.833
As the variances can be considered as equal, a classical test under the normal assumption gives a t-value of -3.1561 indicating a signicant dierence because the P-value is 0.0046.

Assumptions :

  1. Random sampling of observations from the population
  2. Equal population standard deviations for males and females
  3. Normal distributions for test scores within groups

A permutation test approach Based on the fact that if there is no dierence between the two populations then the result will be compatible to allocation at random of each observation to one of two groups (shuing). Any statistical measure of the dierence could be used here : mean dierence, studentised (t) dierence, or even a median dierence.

Randomization test algorithm

  1. Compute the mean difference between samples A and B: MAB(0).
  2. For p=1 to P do
    1. Randomly allocate 12 observations to sample A and 10 to sample B
    2. Compute the mean difference MAB(p)
  3. Is MAB(0) a typical value from the randomization distribution MAB(p),p=1..P?
    (Compute #{MAB(p)<MAB(0)}/P and #{MAB(p)<MAB(0)}/P

A bootstrap approach Following the previous denitions of the bootstrap a possible bootstrap test could run as follows:

  1. Compute MAB=mean difference between the two original samples
  2. For b=1 to B=500 do :
    1. Draw a bootstrap sample from population A.
    2. Draw a bootstrap sample from population B.
    3. Compute a bootstrap estimate of the mean difference :MD(b).
  3. Construct the bootstrap distribution of MD, eventually smoothed and recentred at 0.
    Situate MAB(0) with respect to this distribution.

6.15.2 More generally

Approximate randomisation tests are frequently used to test independance of two sets of variables (which is the null hypothesis). Estimation of the distribution under this hypothesis is obtained by shuing one variable (or set of variables) with respect to the other variable(s). (see Manly, 1990, for an easy overview).

However it is important to note that there is no assumption on the fact that the observations form a random sample from some population. There is no inference that can be carried automatically over to a hypothetic population.

On the contrary the assumption underlying the bootstrap is that the observations form a random sample whose distribution approximates that of the population. Thus the conclusions drawn from a bootstrap test can be used to infer on the population.

Noreen 1989 suggests using a combination of these two tests when inference is needed for testing whether two variables are stochastically independant in the population. However one must keep in mind that at the same time one tests whether the marginal distributions of the variables in the sample satisfactorially estimate those of the variables in the population.

6.16 Finding out how good a parametric bootstrap procedure is through a simulation experience :

Confidence Intervals for the expectation of an exponential distribution

  1. Choose n
  2. Choose an expectation for X :m
  3. For b=1 to B do :
    1. Create a sample Xb* = X 1b,...,X nb having an exponential distribution
      (P{X  < d} = 1 - e-md)
      (In fact we use U1,U2,...,Un uniformly distributed on [0,1]
      and then compute Xi = 1m log (1 - Ui))
    2. Compute a bootstrap 0.95 confidence interval for m
    3. Count cb = 1 if m is in the interval and cb = 0 otherwise.
  4. P = 1-
S  sum 1Sc i

For B large enough, P provides the probability that m is in the bootstrap condence interval.

In order for the experience to be really useful, one ought to retry with dierent sample sizes and dierent expectations, one could also retry with all sorts of dierent distributions.

Using the Bootstrap to build Condence Intervals

We call the unknown parameter h1 and our estimate ^h . Suppose that we had an ideal (unrealistic) situation in which we knew the distribution of ^h -h1 , we will be interested especially in its quantiles : denote the a-
2 quantile by d and the 1 -a-
2 quantile by d . By denition we have: P(^
h - h1 < d) = a-
2 P(^h - h1 <d ) = 1 -a2

It is by writing this that you can build the correct intervals.

Now I am going to compare an answer to an estimation probalem both through maximum likelihood and through bootstrapping. First let us remind ourselves of notations:

6.17 Maximum Likelihood Estimation

X1,X2,X3,...Xn have joint density denoted

fh(x1,x2, ...,xn) = f((x1,x2,...,xn |h)

Given observed values X1 = x1,X2 = x2,...,Xn = xn, the liklihood of his the function

lik(h) = f(x1,x2,...,xn |h)
considered as a function of h.

If the distribution is discrete, f will be the frequency distribution function.

In words: lik(h)=probability of observing the given data as a function of h. Denition:
The maximum likelihood estimate (mle) of his that value of hthat maximises lik(h): it is the value that makes the observed data the \most probable".

If the Xi are iid, then the likelihood simplies to

          prod n
lik(h) =     f(xi|h)
         i=1

Rather than maximising this product which can be quite tedious, we often use the fact that the logarithm is an increasing function so it will be equivalent to maximise the log likelihood:

        sum n
l(h) =     log(f (xi| h))
       i=1

6.17.1 Maximum Likelihood of Multinomial Cell Probabilities

X1,X2,...,Xm are counts in cells/ boxes 1 up to m, each box has a dierent probability (think of the boxes being bigger or smaller) and we x the number of balls that fall to be n:X1 + X2 + ... + Xm = n. The probability of each box is pi, with also a constraint: p1 + p2 + ... + pm = 1, this is a case in which the Xi's are NOT independent, the joint probability of a vector x1,x2,...,xm is called the multinomial ,and has the form:

Each box taken separately against all the other boxes is a binomial, this is an extension thereof.

We study the log-likelihood of this :

                          sum m           sum m
l(p1,p2,...,pm) = logn! -    i=1logxi! +   i=1xilogpi
However we can't just go ahead and maximise this we have to take the constraint into account so we have to use the Lagrange multipliers again.

We use

                                            sum m
L(p1,p2,...,pm, c) = l(p1,p2,...,pm) + c(1 -   i=1pi)

6.17.2 The Hardy Weinberg Example

This is a trinomial with three boxes: the probabilities are parametrized by:

(1 - h)2     2h(1-  h)    h2
l'(h) = - 2X1-+-X2--+ 2X3--+-X2-
           1 - h          h
          2X1 +  X2    2X3 + X2
l''(h) = - --------2-+  ----2-----
           (1-  h)        h

Each of the counts is binomially distributed with probabilities as described above so that:

E(X1)    = n(1 - h)2
E(X  )  = 2nh(1  - h)
    2           2
E(X3)       = nh

We can either use asymptotic maximum likelihood or the bootstrap, we are going to compare the two using Splus.

After some algebra we nd:

   '' ^       ----2n----
E[l (h )] = -     ^  ^
             (1 - h )h

Of course as usual we don't know h we have to plug in the estimate:

            ^ ^
s   =  (1---h-)h--
 h^       2n
Example
n = 1029, X1 = 342,X2 = 500,X3 = 187 Using the MLE : ^h = 2X3+X2-
  2n = .4247, and the MLE estimate for the standard deviation is: s^h = .011, we will now compare to what a parametric bootstrap estimate gives:

The bootstrap procedure we are going to follow generates multinomials with the estimated probabilites and then computes for these new counts the bootstrap estimate for which we then compute the standard deviation.

Here is the R code I used:

multin =   function(pvec, num)  
{  
#Function that generates the output of  
#putting num balls into length(pvec) boxes,  
#with probabilities each given by the vector pvec  
        run <- runif(num)  
        gener <- cut(run, cumsum(c(0, pvec)))  
        return(gener)  
}  
 
> theta=function(xvec)  
{  
        return((2 * xvec[3] + xvec[2])/(2 * sum(xvec)))  
}  
 
bts.ml.mw=function(thetahat = 0.4247, nboot = 1000)  
{  
        res.bt <- rep(0, nboot)  
        pv <- c((1 - thetahat)^2, 2 * thetahat * (1 - thetahat), thetahat^2)  
        for(b in (1:nboot)) {  
                xvec <- multin(pv, 1029)  
                res.bt[b] <- theta(xvec)  
        }  
        return(res.bt)  
}  
 
 bt1 <- bts.ml.mw()  
quantile(bt1, c(0.025, 0.975))  
     2.5%     97.5%  
 0.403292 0.4446186  
 
quantile(bt1 - 0.4247, c(0.025, 0.975))  
        2.5%      97.5%  
 -0.02140797 0.01991856  
 
> sqrt( var(bt1))  
[1] 0.01074981

Chapter 7
Density Estimation

See the very detailed notes in the book, chapter 10.

Chapter 8
References

CRAN
Numerical Recipes http://www.library.cornell.edu/nr/cbookcpdf.html
Anuj Srivastava's Notes: http://calais.stat.fsu.edu/anuj/PDF-les/STA5107-s-05/BOOK.pdf
Julian Faraways Anova and Regression with R: http://www.stat.lsa.umich.edu/~faraway/book/
http://www.quantlet.com/mdstat/scripts/csa/html/node45.html
http://www.neurosci.aist.go.jp/ akaho/MixtureEM.html

http://www.riskglossary.com/articles/monte_carlo_method.htm

MCMC notes:
The Gibbs Sampler Motif Page
Gibbs sampler for a Normal
Peter Green's Reversible Jump

Bootstrap Course:
Susan Holmes Stat 208 course

Density Estimation:
http://www.stat.berkeley.edu/~stark/Teach/S240-s03/Notes/ch8.pdf

Smoothing:
John Fox introduction to nonparametric regression

MAchine Learning Data Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html