Computational Statistics:
Stat 227


May 25, 2005

Contents

1 Logistics
 1.1 Prerequisites
 1.2 Text
 1.3 Assignments
 1.4 Projects
 1.5 Office Hours
 1.6 Course Web Site
 1.7 Startup computer labs Sequoia 201
 1.8 Outline
 1.9 Statistical Methods
 1.10 Algorithms
 1.11 Software
2 Homeworks
3 Numerical Analysis for Statisticians
 3.1 If you missed some lectures:
 3.2 Real numbers: Floating Point Arithmetic
 3.3 Matrices, and their decompositions
 3.4 The linear equation solving problem
  3.4.1 Square systems
  3.4.2 Rectangular system: QR decomposition and least squares
  3.4.3 Examples
  3.4.4 Givens Rotations
  3.4.5 R examples handout
  3.4.6 Backsubstitution
 3.5 Householder Functions
  3.5.1 Householder Multiplication-Row
 3.6 QR decomposition in R
 3.7 Singular value decompostion and pc regression
  3.7.1 Ridge Regression
  3.7.2 Principal Components Regression
  3.7.3 Principal Components
 3.8 More than I said on the Singular Value Decomposition
  3.8.1 Generalized Inverses
 3.9 Principal Components
  3.9.1 Eigenvalue Analysis
4 EM algorithm
  4.0.2 Implementation
  4.0.3 Estimating Mixture Proportions
  4.0.4 EM for exponential families
5 Monte Carlo Simulations
 5.1 Monte Carlo methods
  5.1.1 Antithetic Resampling
  5.1.2 Importance Sampling
 5.2 Random Number Generation- Uniform[0, 1]
  5.2.1 Congruential Methods
  5.2.2 Other `Better?' Methods
  5.2.3 Random Number Generation- other than Uniform
  5.2.4 Inversion Method
  5.2.5 Rejection Methods
  5.2.6 Table Lookup
 5.3 Specialized Methods
  5.3.1 Polar methods for the Normal
 5.4 Importance Sampling
6 The Bootstrap, Permutation Tests, Simulation
 6.1 Motivation
 6.2 The bootstrap : the univariate context
 6.3 In practice
 6.4 S examples for simple bootstraps
 6.5 Parametric bootstrap
 6.6 Smoothed Bootstrap
 6.7 Dierent Uses for bootstrapping
  6.7.1 Quality of Estimates
 6.8 Enhancements
 6.9 Bootstrap-t : Studentizing
 6.10 R functions
 6.11 Preliminary Transformations of the Estimate
 6.12 Pre-pivoting or Double Bootstrap
 6.13 Balanced Bootstrap : saving computations
 6.14 Antithetic Sampling : more on saving computations
 6.15 Bootstrap and Randomization Tests
  6.15.1 An example : Infants walking
  6.15.2 More generally
 6.16 Finding out how good a parametric bootstrap procedure is through a simulation experience :
 6.17 Maximum Likelihood Estimation
  6.17.1 Maximum Likelihood of Multinomial Cell Probabilities
  6.17.2 The Hardy Weinberg Example
7 Density Estimation
8 References

Chapter 1
Logistics

Time and Place: MW 11.-12.15, Hewlett 101, please do not arrive late and distract your colleagues

1.1 Prerequisites

Computer Science 106(Introd. to Computer Programming) or equivalent
Stat 116 (Probability) or equivalent
Math 113 (Matrix Algebra) or equivalent

1.2 Text

Computational Statistics, Geof Givens and Jennifer Hoeting. Wiley 2005

1.3 Assignments

There will be four homeworks/labs to hand in 40 % and one big project ( final) 60 %

1.4 Projects

Projects

  1. Pick a paper from the projects below or on the website on the course webpage at
    http://www-stat.stanford.edu/~susan/courses/s227/.
  2. Write a scientific introduction to the paper you have chosen: context and state of the art at the time of the paper (eventually if the paper is old, what has changed in the area since the paper was written).
  3. Write an algorithm explaining what the method in the paper does.
  4. Simulate data from a known distribution appropriate for the analysis in the paper.
  5. Replicate the method in the paper by writing an R program that implements one of the methods implemented in the paper and show it does what the authors say it does on your simulated data.
  6. Analyze some real data with your R program.
  7. Provide a complete bibliography.
    Bonus: Make a R package with the simulated data, example data and documented programs.

    Hint: the more original the work the higher the grade.

Papers for projects

Gibbs Sampler
Estimating the number of essential genes in a genome by random transposon mutagenesis Natalie J. Blades and Karl W. Broman, technical report, Johns Hopkins.
Coupling from the Past
Exact Sampling with Coupled Markov Chains and Applications to Statistical Mechanics, (1996) Jim Propp and David Wilson
http://www.math.wisc.edu/~propp/sample-short.ps.gz
Sequential Inportance Sampling for Tables
Chen, Y., Diaconis, P., Holmes, S. and Liu, J.S. (2005). Sequential Monte Carlo Methods for Statistical Analysis of Tables. Journal of the American Statistical Association, 100, 109-120.
http://www-stat.stanford.edu/~susan/papers/chenJASA05.pdf
Chen, Y., Dinwoodie, I. H., and Sullivant, S. (2004).
http://www.stat.duke.edu/~yuguo/paper/sis-4.pdf
Mixtures
Bayesian analysis of Poisson mixtures (Peter Green and Valerie Viallefont and Sylvia Richardson)
http://www.stats.bris.ac.uk/~peter/papers/poisjns2.ps
Bayesian Modelling and Inference on Mixtures of Distributions Jean-Michel Marin, Kerrie Mengersen and Christian P. Robert Handbook of Statistics 25, Elsevier
http://eprints.qut.edu.au/archive/00000901/02/Bayesian_Modelling.pdf
Reversible jump Monte Carlo
Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika, 82, 711-732 (1995).
http://www.stats.bris.ac.uk/~peter/papers/RJMCMCBka.pdf
Exhaustive Methods
http://www-stat.stanford.edu/~susan/courses/s208/node12.html
P. Diaconis and S. Holmes. Gray codes for randomization procedures. Statistics and Computing, 4:287-302, 1994.
Particle Filtering
Doucet, A., de Freitas, J.F.G. and Gordon, N.J. (2001). Sequential Monte Carlo Methods in Practice. New York: Springer-Verlag.
Doucet, A., Godsill, S.J. and Andrieu, C. (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statist. Comput. 10 197-208.
Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Statist. 5 1-25.
Liu, J.S. and Chen, R. (1998). Sequential Monte Carlo methods for dynamic systems. J. Amer. Statist. Assoc. 93 1032-1044.
Liu, J.S. (2001). Monte Carlo Strategies in Scientific Computing. New York: Springer-Verlag.
Pitt, M.K. and Shephard, N. (1999). Filtering via simulation: auxiliary particle filter. J. Amer. Statist. Assoc. 94 590-599.
Data Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html

1.5 Office Hours

Professor Susan Holmes: Wednesday at 1.45, in Sequoia 102.
By email appointment to susan at stat.stanford.edu.
TA: Patrick Perry (patperry at stanford.edu), room 108.
Office hours: Monday 2.30-4.00pm.

1.6 Course Web Site

This will contain a bulletin board, homeworks, course summary, project description list, reading list, links to useful sites with in particular tutorials, software information, etc...

Address:http://www-stat.edu/~susan/courses/s227/

Weekly consultation of the web site will be necessary and expected of all students. Homework and lab. assignments and solutions will be on the course's coursework site.

1.7 Startup computer labs Sequoia 201

There will be 3 sets of labs with exercices on accessing the web and usin R.

Times: TBA

1.8 Outline

The course will contain 3 parts, here is a list containing both subjects covered in class and potential project matter:

1.9 Statistical Methods

Maximum Likelihood, EM algorithm, Missing Data,
Bootstrap, Cross-Validation
Step by step multivariate regression, robust regression,
Non-parametric Regression, Alternate Conditional Expectation, Projection Pursuit,
(Neural Nets)
Principal Components, Correspondence Analysis,
Classification and Regression Trees, Clustering,
Nonparametric Confidence Regions, (convex hulls),
Approximate Counting, Exhaustive Enumeration, Non parametric Tests.

1.10 Algorithms

Matrix manipulations, decompositions, (cholesky, QR,...),
Singular Value Decompositions, Eigenanalysis,
Iterative methods, Gauss-Seidel, Conjugate Gradient,
Downhill Simplex minimization,
Smoothing,
Random Number Generation ,
(Uniform, Normal, Beta, Inverse CDF, Acceptance Rejection, Metropolis),
Markov Chain approach to Combinatorial Optimization (Simulated Annealing), Updating,
Markov Chain Monte Carlo, Gray Codes
Sorting.

1.11 Software

Most methods exist in high level programming environments such as R.

We will learn about several additional R functions/packages such as utils (Sweave), mva,ade4, boot, MCMCpack, Rggobi, rpart, rrcov, modreg,....

Symbolic software will be introduced, if not used much. (maple)

Chapter 2
Homeworks

Hw 1 to hand in before Tuesday, April 19th at 4.30pm Rlab1.pdf
See coursework for solutions.

Hw 2 to hand in before Thursday, April 28th at 4.15pm hw2.pdf

Hw 3 to hand in before Monday, May 9th at 4.15pm Chapter 6, problems 6.4, 6.5 and 6.6.

Hint: If you are having trouble with the problem 6.4, there is a dierent approach using Gibbs sampling that is given in detail here:
http://psblade.ucdavis.edu/BMSBSA/chapter09.ps, this might help you somewhat with a simplification of your calculations.

Hw 4 to hand in before Monday, May 23th at 4.15pm

  1. Pick a paper from the projects below or on the website on the cours e webpage at
    http://www-stat.stanford.edu/~susan/courses/s227/.
  2. Write a summary of the paper you have chosen in 2 paragraphs.
  3. Explain what you plan to do in your final project.
  4. Chapter 9, do problem 9.6.
  5. Chapter 10, number 10.1 except for (e).
    Hint: Look at the packages quantreg, locfit and lokern for adaptive bandwidth methods.
    Very nice set of notes that will help you (in particular page 11):
    http://www.stat.berkeley.edu/~stark/Teach/S240-s03/Notes/ch8.pdf

Chapter 3
Numerical Analysis for Statisticians

3.1 If you missed some lectures:

Don’t make a habit of it :75 minutes gained will probably cost you the triple to catch up!

I won't be explicitly writing all my lectures as notes on the web, but for the benefit of those who missed a lecture, here is a little of what I said, so you get an idea about at what level the course will be:

3.2 Real numbers: Floating Point Arithmetic

                        e
x = ± [.d1d2d3 ...dt]× b
x = (- 1)s[.d d d d  ...d   ]× bE -e
            0 1 2 3     t- 1

These are normalized so that the first digit of the mantissa is non zero, in the binary mode, (b = 2), this means we don't have to code the first one, sometimes it is left out (phantom of the digit).

If x < bEmin-1 (smallest number representable in machine)
we have underflow,
if x > bEmax(1 - b-t) overflow.

Otherwise x is in the representable range, maybe we won't be able to represent it exactly. Macine Accuracy:
Smallest fp number added to 1 that gives an answer dierent from 1. (eps in matlab)

Now if we can't represent x without representation error, we'll call x*

Taylor series show how much we are going to be o.

When error growth is linear, the sensitivity is small: f is said to be well-conditionned. k relative to the data x and the problem S: is the constant that bounds the relative error.

||dS ||     ||dx||
------< k -----
||S||     ||x||

Relative to the norm, the problem and that particular data x.

Worse still, we usually don't have f, but an approximation of it. this induces a second type of error: generated error, also called truncation error because we stop computing before we have the exact right answer(for an integral computed by trapezoids, or any iterative method for computing functions by series, ...) for example.

f (x)-  f * (x*) = [f(x) - f(x*)] + [f(x*) - f *(x*)]
Stability:
The precision with which f* approximates f is called the stability of f. If for an input x there is another ~xclose to x such that f * (x) = f(~x), then the f* algorithm is said to be stable.

An unstable algorithm is one where the roundo error gets mixed into the calculation at an early stage and is successively magnified until it swamps the true answer.

Example: Want to compute the powers of the golden mean:

      V~ --
       5-  1 .
f =  ---2--- = .61803398
Also we have the recurrence relation:
 n+1     n-1    n
f    = f    - f
But it has another solution.
       V~ --
    ----5---1
f =     2

Error analysis:

Moment computations, (see Pages 612-613 of Numerical Recipes)

A little example of how mistakes tote up on a total.

Vectors These are the basic building blocks of the new generation of languages, unlike C or fortran whose are fp numbers, the gains are important in writing all computations vectorially.

3.3 Matrices, and their decompositions

Matrices Notations, operations, partitionning, careful with the defaults from one system to another.

The Botany of Matrices Diagonal, Symmetric, Banded, Hessenberg, Circulant, Hankel, Triangular, Sparse, Orthogonal, semi-positive definite,

My notations for: transpose, inner product,norm, matrix norms,

3.4 The linear equation solving problem

Suppose we have a linear system of equtions, typified by the matrix of coefficients A, unknowns, x and the right hand sides, the vector b.

3.4.1 Square systems

Singularity, Degenerate Systems
If the rows are linear combinations of one another, or the columns linear combinations or one another then even if there seems to be the same number of unknowns as equations, the system is in fact equivalent to a singular set of equations.

Close singularity is just as bad because the roundo could take us there, we will have to measure this.

Single precision is usually enough for 30-50 equations.

Double precision makes it possible to solve for a few hundred.

This does NOT hold for pathological systems we will look into these and how to recognize them when we talk about conditionning.

A full square system
:

Suppose the matrix of coefficients is a square matrix, (of full rank also, as we saw, non-degenerate) then the equation looks like this:

Ax  = b

Formally you would say : invert it, in fact a computational mistake, except if you have multiple right hand sides.

Doing what one would first think of doing if you had to solve a system by hand.

The easiest system: triangular.

Bringing to triangular shape.

Pivotting. This is NOT to solve ill-conditionning, but unlucky diagonal elements that are very small make the algorithm unstable.

LU decomposition, with a permutation matrix (prepivotting) is what is best to do.

A = LU,  L lower  triang.U  upper triang.

Flops: Of an order n3
2- (the same as the cost of the multiplication of two matrices, not so bad).

Condition number for the inversion problem

Try a small perturbation of b to b* = b + db, call the solution of the perturbed system: x* = x + dx,

A(x  + dx)  =   b + db
    x + dx  =   A- 1(b + db)
                 - 1
        dx  =   A   db
     ||dx|| <   ||A -1||||db||
       ||b|| <   ||A ||||x ||

     ||dx-||  <   ||A ||||A -1||||db||
      ||x||                 ||b||
The condition number for the data A and the linear solving problem is thus:
k =  ||A ||||A -1||
(depends on the norm, not on the algorithm). For the two norm we have a better explanation of how to compute this, we need the following to do it:
Singular Value Decomposition
This is the most important decomposition in statistics.

We will study it several times, here is my first introduction: a little made-up example with matlab to start with:

>> u=[3 1 -1 2]'  
u =  
     3  
     1  
    -1  
     2  
>> v=(1:4)'  
v =  
     1  
     2  
     3  
     4  
>> X=u*v'  
X =  
     3     6     9    12  
     1     2     3     4  
    -1    -2    -3    -4  
     2     4     6     8  
>>svd(X)  
ans =  
   21.2132  
    0.0000  
         0  
         0  
 
>> E=10^(-3)*randn(4);  
>> XE=X+E  
XE =  
 
    3.0012    5.9993    9.0003   12.0012  
    1.0006    2.0017    3.0009    3.9994  
   -0.9999   -1.9999   -3.0014   -3.9994  
    2.0004    4.0018    5.9993    7.9996  
 
>> svd(XE)  
ans =  
   21.2143  
    0.0028  
    0.0019  
    0.0005  
>> cond(X)  
Condition is infinite  
ans =  
   Inf  
>> cond(XE)  
ans =  
 
   4.2567e+04  
%An example you can't see with your bare eyes:  
>> X2=u'*v  
X2 =  
    0.2468    0.2499    0.0166    0.2487  
    0.3225    0.3266    0.0218    0.3250  
    0.1495    0.1514    0.0101    0.1506  
    0.4367    0.4423    0.0295    0.4401  
 
>> svd(X2)  
ans =  
 
    1.0730  
    0.0000  
         0  
         0  
>> [U,S,V]=svd(X2);  
>> U*S*V'  
ans =  
    0.2468    0.2499    0.0166    0.2487  
    0.3225    0.3266    0.0218    0.3250  
    0.1495    0.1514    0.0101    0.1506  
    0.4367    0.4423    0.0295    0.4401  
>> 10000*(X2-U*S*V')  
ans =  
    1.0e-11 *  
   -0.0278    0.0833   -0.0035    0.0555  
   -0.0555    0.0555         0    0.0555  
         0    0.0555    0.0017    0.0278  
   -0.0555    0.1665         0    0.1110

>> A=rand(4)  
A =  
    0.5045    0.4940    0.0737    0.9138  
    0.5163    0.2661    0.5007    0.5297  
    0.3190    0.0907    0.3841    0.4644  
    0.9866    0.9478    0.2771    0.9410  
>> flops(0)  
>> [L,U,P]=lu(A)  
L =  
     1.0000         0         0         0  
    0.5233    1.0000         0         0  
    0.5114   -0.0406    1.0000         0  
    0.3234    0.9388    0.7363    1.0000  
U =  
     0.9866    0.9478    0.2771    0.9410  
         0   -0.2298    0.3557    0.0373  
         0         0   -0.0535    0.4342  
         0         0         0   -0.1945  
>> flops  
ans =  
    34  
>> P*A  
ans =  
    0.9866    0.9478    0.2771    0.9410  
    0.5163    0.2661    0.5007    0.5297  
    0.5045    0.4940    0.0737    0.9138  
    0.3190    0.0907    0.3841    0.4644  
>> L*U  
ans =  
     0.9866    0.9478    0.2771    0.9410  
    0.5163    0.2661    0.5007    0.5297  
    0.5045    0.4940    0.0737    0.9138  
    0.3190    0.0907    0.3841    0.4644  
>> P*A-L*U  
ans =    1.0e-15 *  
         0         0         0         0  
   -0.1110         0         0         0  
         0         0         0         0  
   -0.0555   -0.0139         0         0  
 

Here is what we need to remember:

X  =  U SV  ',V 'V  =  I,U 'U  =  I,S  diagonal   s
                                                   i
Actually the singular values are the square roots of the eigenvalues of X'X.

The condition number in l2 norm has the simple interpretation as a function of the singular values, ration of the largest to the smallest:

     s1-
k =  sn

Read 2.6 of NR (pages 61-71).

Tridiagonal

Of an order of N steps are sufficient, see NR 51.

Iterative Improvement

This enables, a the price of storing A, to improve the solution by using the residual. Do the following in double precision:

           *
r = b - Ax  ,solve Ly   =  P r f or y
              and  U z  =  y for z
                 x      =  x* + z
                   new
                Axnew   =  A(x* +  z) = (b- r) + Az =  b
Cholesky Decomposition
For a symmetric positive definite matrix A the LU decomposition can be rewritten:
        '       1/2   1/2 '
A =  GG  =  (LD    )(D    L )
with D a diagonal matrix with only positive entries.

3.4.2 Rectangular system: QR decomposition and least squares

When there are more equations than variables, one can't hope for an exact solution, the system is overdetermined.

We try to minimize the error which is usually measured as the norm

||Ax  - b||, for the equation Ax =  b

Important remark: For any orthogonal Q, this norm is unchanged by trnasformation by Q, so

||Ax - b||=  ||Q'Ax  - Q'b||
we try and find a Q that makes the minimization particularly simple, that is so that Q'A has a particularly simple form.

3.4.3 Examples

Least Squares in Two dierent ways:

>> t = (1900:10:1990)';  
>> y = [75.995 91.972 105.711 123.203 131.669 ...  
     150.697 179.323 203.212 226.505 249.633]';  
>> X=[t ones(10,1)]  
X =     1900           1  
        1910           1  
        1920           1  
        1930           1  
        1940           1  
        1950           1  
        1960           1  
        1970           1  
        1980           1  
        1990           1  
>> bhat=X\p  
bhat =    1.93  
      -3594.01  
>>yhat=polyval(bhat,t);  
>>[polyval(bhat,t) y]  
ans =    67.08         76.00  
         86.35         91.97  
        105.62        105.71  
        124.89        123.20  
        144.16        131.67  
        163.43        150.70  
        182.70        179.32  
        201.96        203.21  
        221.23        226.50  
        240.50        249.63  
 
>> norm(yhat-y)  
ans =  
   23.5794

>> [Q,RC]=qr([X y]);  
RC =  
      -6151.30         -3.16       -488.86  
             0         -0.05        167.82  
             0             0         23.58  
             0             0             0  
             0             0             0  
             0             0             0  
             0             0             0  
             0             0             0  
             0             0             0  
             0             0             0  
>>  R1=RC(1:2,1:2)  
R1 =    -6151.30         -3.16  
              0          -0.05  
>> C1=RC(1:2,3)  
C1 = -488.8643  
      167.8181  
>> R1\C1  
ans =     1.93  
      -3594.01  
>> C2=RC(3,3)  
C2 = 23.5794  
>>[Q*[C1;zeros(8,1)] yhat]  
ans      67.08         67.08  
         86.35         86.35  
        105.62        105.62  
        124.89        124.89  
        144.16        144.16  
        163.43        163.43  
        182.70        182.70  
        201.96        201.96  
        221.23        221.23  
        240.50        240.50

3.4.4 Givens Rotations

What are Given's rotations good for?

You can use them to zero out individual isolated elements in any matrix, without changing any of the norms of the vectors, these transformations are orthogonal.

They are based on simple 2 × 2 ones of the form:

[        ]
   c   s

  - s  c

You can choose c and s so that a certain vector is orthogonally transformed (actually, rotated) into a vector y = (y1,y2) whose second coordinate is zero.

This is of course easy by solving the equations

           2   2
          c + s  = 1                                   (3.1)
y2 = - sx1 + cx2 = 0                                   (3.2)

                                                       (3.3)

The examples that follow show two ways of doing this:

%%%%Van Loan's Function, Chapter 7%%%%%%%%  
  function [c,s] = Rotate(x1,x2);  
% Pre:  
%   x1,x2   scalars  
% Post:  
%   c,s     c^2+s^2=1 so -s*x1 + c*x2 = 0.  
%  
   if x2==0  
      c = 1;  
          s = 0;  
   else  
      if abs(x2)>=abs(x1)  
             cotangent = x1/x2;  
                 s = 1/sqrt(1+cotangent^2);  
                 c = s*cotangent;  
          else  
             tangent = x2/x1;  
                 c = 1/sqrt(1+tangent^2);  
                 s = c*tangent;  
          end  
   end  
%%%%%%%%%%%%%%%The resident function:%%%%%%%  
function [G,x] = planerot(x)  
%PLANEROT Generate a Givens plane rotation.  
%       [G,Y] = PLANEROT(X), where X is a 2-component column vector,  
%       returns a 2-by-2 orthogonal matrix G so that Y = G*X has Y(2) = 0.  
%       See also QRINSERT, QRDELETE.  
if x(2) ~= 0  
   r = norm(x);  
   G = [x'; -x(2) x(1)]/r;  
   x = [r; 0];  
else  
   G = eye(2);  
end  
 
 
>> x=[2 3]';  
>> [G,y]=planerot(x)  
 
G =  
          0.55          0.83  
         -0.83          0.55  
y =  
          3.61  
             0

Now in order to extend this to a general context you just notice that if you fill out the rest of the matrix with 1's on the diagonal, it will not change the other coordinates thus:

 |_                   _| 
    c   0  s  0  0
    0   1  0  0  0

   - s  0  c  0  0

 |_   0   0  0  1  0  _| 
    0   0  0  0  1
is a rotation in the (1,3) plane in 5 dimensions. In order to reduce to upper triangular form, one cycles through all the elements that you want to zero out one by one.
 |_                _|     |_                 _|     |_                _| 
   ×  ×   ×   ×         ×   ×  ×   ×         ×   ×   ×  ×

   ×  ×   ×   ×         ×   ×  ×   ×         ×   ×   ×  ×

   ×  ×   ×   ×   -->     ×   ×  ×   ×    -->    ×   ×   ×  ×
 |_  ×  ×   ×   ×  _|     |_  ×   ×  ×   ×   _|     |_  0  ×   ×  ×   _| 

   ×  ×   ×   ×         0   ×  ×   ×          0  ×   ×  ×

3.4.5 R examples handout

3.4.6 Backsubstitution

backsub=function(X,y){  
#Backsubstitution to find solution in a triangular case  
l=dim(X)  
p=l[2]  
b=matrix(0,p,1)  
b[p,1]=y[p,1]/X[pp]  
for (j in seq(p-1,1,-1){  
b[j,1]=(y[j,1]-X[j,(j+1):n]*b[(j+1):n,1])/X[j,j]  
}  
return(b)}

3.5 Householder Functions

Algorithm  (Householder Vector) in R  
 
house= function(x){  
#Householder vector calculation  
p = length(x);  
mu = sqrt(x%*%x)  
v = x  
if (mu!= 0)  
{  
beta = x[1] + sign(x[1])*mu  
 
v[2 : p,1] = v[2 : p,1]/beta  
}  
v[1] = 1  
return(v)  
}

Algorithm  (Householder Vector) in matlab  
 
function v = house(x);  
%Compute a Householder vector  
m = length(x);  
mu = norm(x,2);  
v = x;  
if mu ~= 0  
       beta = x(1) + sign(x(1))*mu;  
      v(2 : m,1) = v(2 : m,1)/beta;  
end  
v(1) = 1

3.5.1 Householder Multiplication-Row

rowhouse=function(X,v){  
beta=-2/(t(v)%*%v)  
w=beta*t(X)%*%v  
X=X+v%*%t(w)  
return(X)}

3.6 QR decomposition in R

qr                   package:base                   R Documentation  
The QR Decomposition of a Matrix  
Description:  
     'qr' computes the QR decomposition of a matrix.  It provides an  
     interface to the techniques used in the LINPACK routine DQRDC or  
     the LAPACK routines DGEQP3 and (for complex matrices) ZGEQP3.  
Usage:  
     qr(x, tol = 1e-07 , LAPACK = FALSE)  
     qr.coef(qr, y)  
     qr.qy(qr, y)  
     qr.qty(qr, y)  
     qr.resid(qr, y)  
     qr.fitted(qr, y, k = qr$rank)  
     qr.solve(a, b, tol = 1e-7)  
     ## S3 method for class 'qr':  
     solve(a, b, ...)  
     is.qr(x)  
     as.qr(x)  
 
Arguments:  
       x: a matrix whose QR decomposition is to be computed.  
     tol: the tolerance for detecting linear dependencies in the  
          columns of 'x'. Only used if 'LAPACK' is false and 'x' is  
          real.  
      qr: a QR decomposition of the type computed by 'qr'.  
    y, b: a vector or matrix of right-hand sides of equations.  
       a: A QR decomposition or ('qr.solve' only) a rectangular matrix.  
       k: effective rank.  
  LAPACK: logical. For real 'x', if true use LAPACK otherwise use  
          LINPACK.  
     ...: further arguments passed to or from other methods  
Details:  
     The QR decomposition plays an important role in many statistical  
     techniques.  In particular it can be used to solve the equation  
     *Ax* = *b* for given matrix *A*, and vector *b*.  It is useful for  
     computing regression coefficients and in applying the  
     Newton-Raphson algorithm.  
 
     The functions 'qr.coef', 'qr.resid', and 'qr.fitted' return the  
     coefficients, residuals and fitted values obtained when fitting  
     'y' to the matrix with QR decomposition 'qr'. 'qr.qy' and 'qr.qty'  
     return 'Q %*% y' and 't(Q) %*% y', where 'Q' is the (complete) *Q*  
     matrix.  
     All the above functions keep 'dimnames' (and 'names') of 'x' and  
     'y' if there are.  
     'solve.qr' is the method for 'solve' for 'qr' objects. 'qr.solve'  
     solves systems of equations via the QR decomposition: if 'a' is a  
     QR decomposition it is the same as 'solve.qr', but if 'a' is a  
     rectangular matrix the QR decomposition is computed first.  Either  
     will handle over- and under-determined systems, providing a  
     minimal-length solution or a least-squares fit if appropriate.  
     'is.qr' returns 'TRUE' if 'x' is a 'list' with components named  
     'qr', 'rank' and 'qraux' and 'FALSE' otherwise.  
 
     It is not possible to coerce objects to mode '"qr"'.  Objects  
     either are QR decompositions or they are not.  
Value:  
     The QR decomposition of the matrix as computed by LINPACK or  
     LAPACK. The components in the returned value correspond directly  
     to the values returned by DQRDC/DGEQP3/ZGEQP3.  
 
      qr: a matrix with the same dimensions as 'x'. The upper triangle  
          contains the *R* of the decomposition and the lower triangle  
          contains information on the *Q* of the decomposition (stored  
          in compact form).  Note that the storage used by DQRDC and  
          DGEQP3 differs.  
   qraux: a vector of length 'ncol(x)' which contains additional  
          information on *Q*.  
 
    rank: the rank of 'x' as computed by the decomposition: always full  
          rank in the LAPACK case.  
 
   pivot: information on the pivoting strategy used during the  
          decomposition.  
     Non-complex QR objects computed by LAPACK have the attribute  
     '"useLAPACK"' with value 'TRUE'.  
Note:  
     To compute the determinant of a matrix (do you _really_ need it?),  
     the QR decomposition is much more efficient than using Eigen  
     values ('eigen').  See 'det'.  
     Using LAPACK (including in the complex case) uses column pivoting  
     and does not attempt to detect rank-deficient matrices.  
References:  
     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S  
     Language_. Wadsworth & Brooks/Cole.  
     Dongarra, J. J., Bunch, J. R., Moler, C. B. and Stewart, G. W.  
     (1978) _LINPACK Users Guide._  Philadelphia: SIAM Publications.  
     Anderson. E. and ten others (1999) _LAPACK Users' Guide_. Third  
     Edition. SIAM.  
      Available on-line at <URL:  
     http://www.netlib.org/lapack/lug/lapack_lug.html>.  
See Also:  
     'qr.Q',  'qr.R',  'qr.X' for reconstruction of the matrices.  
     'lm.fit',  'lsfit', 'eigen', 'svd'.  
     'det' (using 'qr') to compute the determinant of a matrix.

Examples:  
hilbert <- function(n) { i <- 1:n; 1 / outer(i - 1, i, "+") }  
 h9 <- hilbert(9); h9  
     qr(h9)$rank           #--> only 7  
     qrh9 <- qr(h9, tol = 1e-10)  
     qrh9$rank             #--> 9  
     ##-- Solve linear equation system  H %*% x = y :  
    y <- 1:9/10  
     x <- qr.solve(h9, y, tol = 1e-10) # or equivalently :  
     x <- qr.coef(qrh9, y) #-- is == but much better than  
                           #-- solve(h9) %*% y  
     h9 %*% x              # = y

h9 %*% (solve(h9) %*% y)  
            [,1]  
 [1,] 0.09999996  
 [2,] 0.20000011  
 [3,] 0.30000012  
 [4,] 0.40000010  
 [5,] 0.50000008  
 [6,] 0.60000007  
 [7,] 0.70000005  
 [8,] 0.80000004  
 [9,] 0.90000003  
> h9 %*% qr.coef(qrh9, y)  
      [,1]  
 [1,]  0.1  
 [2,]  0.2  
 [3,]  0.3  
 [4,]  0.4  
 [5,]  0.5  
 [6,]  0.6  
 [7,]  0.7  
 [8,]  0.8  
 [9,]  0.9

 > round(h9,4)  
        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  
 [1,] 1.0000 0.5000 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111  
 [2,] 0.5000 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000  
 [3,] 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909  
 [4,] 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833  
 [5,] 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769  
 [6,] 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714  
 [7,] 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667  
 [8,] 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667 0.0625  
 [9,] 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667 0.0625 0.0588

3.7 Singular value decompostion and pc regression

3.7.1 Ridge Regression

Ridge regression in R

>library(MASS)  
>help(lm.ridge)

(see Faraways chapter 9 )

Example of a matlab ridge regression function:

function bks=ridge(Z,Y,kvalues)  
% Ridge Function of Z (centered, explanatory)  
% Y is the response,  
% kvalues are the values where to compute  
[n,p]=size(Z);  
ZpY=Z'*Y;  
ZpZ=Z'*Z;  
m=length(kvalues);  
bks=ones(p,m);  
for k =1:m  
 bks(:,k)=(ZpZ+diag(kvalues(k)))\ZpY;  
end  
>> kvalues=(0:.05:.5)  
kvalues =  
  Columns 1 through 7  
         0    0.0500    0.1000    0.1500    0.2000    0.2500    0.3000  
  Columns 8 through 11  
    0.3500    0.4000    0.4500    0.5000  
>> ridge(Z,Y,(0:.05:.5))  
ans =  
  Columns 1 through 7  
    1.5511    1.5176    1.4882    1.4622    1.4390    1.4183    1.3996  
    0.5102    0.4775    0.4488    0.4234    0.4009    0.3806    0.3624  
    0.1019    0.0678    0.0378    0.0113   -0.0122   -0.0334   -0.0524  
   -0.1441   -0.1762   -0.2043   -0.2292   -0.2514   -0.2713   -0.2892  
  Columns 8 through 11  
    1.3827    1.3673    1.3532    1.3403  
    0.3459    0.3309    0.3172    0.3046  
   -0.0696   -0.0853   -0.0997   -0.1128  
   -0.3054   -0.3201   -0.3336   -0.3460  
%Formula gives for choice of k:  
>> norm(Yhat-Yc)^2  
ans =  
   47.8636  
>> 47.8636/(13-5)  
ans =  
    5.9829   % estimates the variance sigma^2  
>> bk0'*bk0  
ans =  
    2.6973  
>> k=(4*5.9829)/2.6973  
k =  
    8.8724     % This is a suggested value for k

3.7.2 Principal Components Regression

If we decompose the centered and rescaled matrix

Z = U SV ', with U 'U = I , V 'V  = I
                         n          p
Call the new variables the components: C = US, their norms arethe singular values: C'C = S2.
Y =  1nb0 + Zb +  e

3.7.3 Principal Components

A quick summary, all is detailed below:

Maximization of u'Au with u'Mu = 1

1) X centered, all points (observations) same weight.

                       sum r
1X  =  0  and   xij =    xitstvjt,
                      t=1
p variables can be replaced by the r columns of v.
        k
 (k)   sum 
xij =     uitstvjt
       t=1
is the best (optimal) approximation of rank k for X.

The columns of V are the directions along which the variances are maximal.

Definition: Principal components are the coordinates of the observations on the basis of the new variables (namely the columns of V ) and they are the rows of G = XV = US. The components are orthogonal and their lengths are the singular values C'C = SU'US = S2.

In the same way the principal axes are defined as the rows of the matrix Z = U'X = SV '. Coordinates of the variables on the basis of columns of U.

Z =  S- 1C'X   and   G  = XZ'S  -1.
However, this decomposition will be highly dependent upon the unity of measurement (scale) on which the variables are measured. It is only used, in fact, when the X.k are all of the same \order". Usually what is done is that a weight is assigned to each variable that takes into account its overall scale. This weight is very often inversely proportional to the variable's variance. So we define a dierent metric between observations instead of
                         '
d (xi.,xj.)  =  (xi.- xj.) ,(xi.-  xj.)
d (xi.,xj.)  =  (xi.- xj.),Q  (xi.-  xj.)
               (  -1          )
                  s21
        Q   =         ...       .
                           -1
                           s2p
The same can be said of the observations; some may be more \important" than others (resulting from a mean of a group which is larger).

The new model will be:

                                       Y  =   1nb0 + Cg  + e
                     Zb  = ZV  V'b = Cg

The best approximation  to rank  k of b :
                                      b+  =   V(k)^g(k)
                                       k
Taking V (k) to be the first k columns of V . The bias of ^b will be: V (p-k)V (p-k)b

Example with R
First rescale to have unit variance:

> apply(Z,2,sd)  
    5.8824   15.5609    6.4051   16.7382  
sweep(Z,2,sd)

Then do the singular value decompositions:

> svd(Zt)$d  
    5.1796    4.3489   1.4964    0.1396

> U=svd(Zt)$u;S=diag(svd(Zt)$d)  
> C=U%*%S;

3.8 More than I said on the Singular Value Decomposition

Singular Value Decomposition

Of numerical and statistical interest is the decomposition of X known as the s.v.d.

Up to now we have shown that X could be written X = UR, U orthogonal n × n R = [     ]
  R1
   0p × p upper trianglar. The transformation of y by U' is c = U'y + e, and the solution to the U's problem is provided by bs such that

E(ci) = R1 ^b.
Could we find a transformation of b such that it could be even simpler (diagonal):
    h  =  V 'b
  '         '               '
U  ^y1  =  E  (c1) = Sh  = SV  ^b
    ^y  =  X b^=  U SV 'b^
               '
   X   =  U SV  .
Theorem:
Let X be an arbitrary n × p matrix, n > p; then there exists Un×n orthogonal and V p × p orthogonal such that:
                                   (  s1  0   0  ...  0  )

        X   =   UnnSnpV p'p    S  =    0   s2  0  ...  0
                                      0   0   0  ...  sp

s1 > s2 >  ...  > sp > 0 ...0.

                (Induction see Golub  & Van  Loan, chapt  1).
The si, for i = 1,...,p are called the singular values of X.

Columns of U are left-singular vectors, columns of V right-singular vectors.

Remark: When X is symmetric positive semi-definite V = V and the singular values of X are its eigenvalues. On the other hand, X'X = V S2V '. The eigenvalues of X'X are the square of the singular values of X (and XX' --> U).

Statistically: E(Y ) = X^
b = USV '^
b = US^
h = XV ^
h . ^
h is the vector of regression coefficients on a new collection of p orthogonal regressors and these coefficients are rotated versions of the original p regression coefficients.

SE(^hi) = s/si  because   ^hi = U 'yi/si,
si small: then SE(^h i) large, which mean that the data provides little information about the regression coefficient on Xvi.

When is Si small?

       2   '  '        '  '  '    '      ' 2     2
||Xvi  || = viX  Xvi = viV S U U SV  vi = eiS  ei = si;
|si| is small i the squared length of Xvi is small; that is, if a partiuclar combination of columns of X is nearly zero, that is when there is near collinearity.

Courant-Fischer Theorem:
Xn×p, n > p and singular values s1 > s2...> sp > 0.

                                 ||u'X||-
sk  =   maxdim_O_=k   minu( - _O_     ||u||
    =   maxdim_O_=k  minu( - _O_   || u'X  ||      _O_ < Rn
                        ||u||=1                      p
    =   maxdimT =k   min v (- T   ||Xv  ||       T <  R
                        ||v||=1
If the si's are distinct and nonzero then u in 2) is uk = U.k and v = vk = V .k.

sk is then the largest possible value for the length of X.

V i is the vector for which the corresponding linear combination of the columns of X has greatest sum of squares. (d1 is its square root).

If the column means have been removed from X then this corresponds to the linear combination that has the largest variance.

max I(v'b : u'y)a s 12

h = (h1,...,hp) parameter of interest
log likelihood l(h) achieves its maximum value for l˙ (h) = @l(h)
 @h = 0p, where

           (                 )
  ˙l(h)  =    @l(h)-...@l(h)--  ,
              @h1       @hp
            @2l(h)
¨l(h)ij  =  ------- where ¨l(h) is a p × p matrix
           @hi@hj
for h to be a minimum in the vector case ¨l (h) must also be negative definite.

The Fisher information or information matrix for h is defined to be

I(h : x) = [- Eh(l¨x( )]h = Eh [˙lx( )][ ˙lx( )]'|h
under general cnditions when the true value of h is h0, the asymptotic variance matrix of the MLE ^h is given by the inverse of the information matrix.

e = Y - X^
b r = rank(X) k2(X) = d1/dr.

TSS = SSM + RSS
Total = Model + Residual
R2 = SSM/TSS = 1 - RSS/TSS.

Theorem: e = max(            )
  ||DX||- ||DY-||
   ||X || , ||Y|| < --1-
k(X).

^g estimate for the perturbed problem of reg. coe.

ê estimate of residually perturbed problem of reg. coe.

For R2 < 1

              (            V~ -------       )
|^b---^g|         2k1(X)--  --1---R2-      2        2
  |^b|    <   e    R     +    R     k2(X)    +  O(e ),

|^ell---l                                2
  |Y |    <   e(1 + 2k(X)(n -  p)) + O(e )
* function of square of the condition number of X.

3.8.1 Generalized Inverses

What if X is not full rank?
This will often happen when coding for analysis of variance.

X'X singular, the vector of coefficients b is not estimable although some linear combinations of b's are estimable. (Those for which I(c'b : Y ) = c'X'Xc are nonzero.

c'b estimable if there exists a vector l such that

E(l'Y )       =        c'b
   '                   '       '     '     *' '    *'   '
E(l Y )       =        lXb  = l V SV  b = l V  b  l  =  lV X
    c'b  estimable  i    c'=  l*'V '
(a) equivalent to c' of the form a'X.

(b) equivalent to c'(X'X)-X'c = cc'.

                        (                      )
                           -1
                           s1  .
                                ..
                                    1-
X+  =  U S+V '    S+  =             sr            .
                                        0  .
                                           ..
                                             0

                                        0
This solution is the one with the minimum squared length, is the same as setting hl+1 = hr+2 = ... = hp = 0. (Otherwise  oo )

X+ is the Moore-Penrose pseudo-inverse of X and if X n×p then Xp×n+ is defined as the unique matrix for which: