Computer Science 106(Introd. to Computer Programming) or equivalent
Stat 116 (Probability) or equivalent
Math 113 (Matrix Algebra) or equivalent
Computational Statistics, Geof Givens and Jennifer Hoeting. Wiley 2005
There will be four homeworks/labs to hand in 40 % and one big project ( final) 60 %
Hint: the more original the work the higher the grade.
Professor Susan Holmes: Wednesday at 1.45, in Sequoia 102.
By email appointment to susan at stat.stanford.edu.
TA: Patrick Perry (patperry at stanford.edu), room 108.
Office hours: Monday 2.30-4.00pm.
This will contain a bulletin board, homeworks, course summary, project description list, reading list, links to useful sites with in particular tutorials, software information, etc...
Address:http://www-stat.edu/~susan/courses/s227/
Weekly consultation of the web site will be necessary and expected of all students. Homework and lab. assignments and solutions will be on the course's coursework site.
There will be 3 sets of labs with exercices on accessing the web and usin R.
Times: TBA
The course will contain 3 parts, here is a list containing both subjects covered in class and potential project matter:
Maximum Likelihood, EM algorithm, Missing Data,
Bootstrap, Cross-Validation
Step by step multivariate regression, robust regression,
Non-parametric Regression, Alternate Conditional Expectation, Projection Pursuit,
(Neural Nets)
Principal Components, Correspondence Analysis,
Classification and Regression Trees, Clustering,
Nonparametric Confidence Regions, (convex hulls),
Approximate Counting, Exhaustive Enumeration, Non parametric Tests.
Matrix manipulations, decompositions, (cholesky, QR,...),
Singular Value Decompositions, Eigenanalysis,
Iterative methods, Gauss-Seidel, Conjugate Gradient,
Downhill Simplex minimization,
Smoothing,
Random Number Generation ,
(Uniform, Normal, Beta, Inverse CDF, Acceptance Rejection, Metropolis),
Markov Chain approach to Combinatorial Optimization (Simulated Annealing), Updating,
Markov Chain Monte Carlo, Gray Codes
Sorting.
Most methods exist in high level programming environments such as R.
We will learn about several additional R functions/packages such as utils (Sweave), mva,ade4, boot,
MCMCpack, Rggobi, rpart, rrcov, modreg,....
Symbolic software will be introduced, if not used much. (maple)
Hw 1 to hand in before Tuesday, April 19th at 4.30pm Rlab1.pdf
See coursework for solutions.
Hw 2 to hand in before Thursday, April 28th at 4.15pm hw2.pdf
Hw 3 to hand in before Monday, May 9th at 4.15pm Chapter 6, problems 6.4, 6.5 and
6.6.
Hint: If you are having trouble with the problem 6.4, there is a di erent approach using Gibbs
sampling that is given in detail here:
http://psblade.ucdavis.edu/BMSBSA/chapter09.ps, this might help you somewhat with a
simplification of your calculations.
Hw 4 to hand in before Monday, May 23th at 4.15pm
Don’t make a habit of it :75 minutes gained will probably cost you the triple to catch up!
I won't be explicitly writing all my lectures as notes on the web, but for the benefit of those who missed a lecture, here is a little of what I said, so you get an idea about at what level the course will be:
![e
x = ± [.d1d2d3 ...dt]× b](t050x.png)
![x = (- 1)s[.d d d d ...d ]× bE -e
0 1 2 3 t- 1](t051x.png)
These are normalized so that the first digit of the mantissa is non zero, in the binary mode, (
= 2),
this means we don't have to code the first one, sometimes it is left out (phantom of the
digit).
If x <
Emin-1 (smallest number representable in machine)
we have underflow,
if x >
Emax(1 -
-t) overflow.
Otherwise x is in the representable range, maybe we won't be able to represent it exactly. Macine
Accuracy:
Smallest fp number added to 1 that gives an answer di erent from 1. (eps in matlab)
Now if we can't represent x without representation error, we'll call x*
Taylor series show how much we are going to be o .
When error growth is linear, the sensitivity is small: f is said to be well-conditionned.
relative to
the data x and the problem S: is the constant that bounds the relative error.

Relative to the norm, the problem and that particular data x.
Worse still, we usually don't have f, but an approximation of it. this induces a second type of error: generated error, also called truncation error because we stop computing before we have the exact right answer(for an integral computed by trapezoids, or any iterative method for computing functions by series, ...) for example.
![f (x)- f * (x*) = [f(x) - f(x*)] + [f(x*) - f *(x*)]](t053x.png)
close to x such that f * (x) = f(
), then the f* algorithm is said to be
stable.
An unstable algorithm is one where the roundo error gets mixed into the calculation at an early stage and is successively magnified until it swamps the true answer.
Example: Want to compute the powers of the golden mean:



Error analysis:
and keep track of all the fp operations and how they make the
errors evolve.
Moment computations, (see Pages 612-613 of Numerical Recipes)
A little example of how mistakes tote up on a total.
Vectors These are the basic building blocks of the new generation of languages, unlike C or fortran whose are fp numbers, the gains are important in writing all computations vectorially.
Matrices Notations, operations, partitionning, careful with the defaults from one system to another.
The Botany of Matrices Diagonal, Symmetric, Banded, Hessenberg, Circulant, Hankel, Triangular,
Sparse, Orthogonal, semi-positive definite,
My notations for: transpose, inner product,norm, matrix norms,
Suppose we have a linear system of equtions, typified by the matrix of coefficients A, unknowns, x and the right hand sides, the vector b.
Close singularity is just as bad because the roundo could take us there, we will have to measure this.
Single precision is usually enough for 30-50 equations.
Double precision makes it possible to solve for a few hundred.
This does NOT hold for pathological systems we will look into these and how to recognize them when we talk about conditionning.
Suppose the matrix of coefficients is a square matrix, (of full rank also, as we saw, non-degenerate) then the equation looks like this:

Formally you would say : invert it, in fact a computational mistake, except if you have multiple right hand sides.
Doing what one would first think of doing if you had to solve a system by hand.
The easiest system: triangular.
Bringing to triangular shape.
Pivotting. This is NOT to solve ill-conditionning, but unlucky diagonal elements that are very small make the algorithm unstable.
LU decomposition, with a permutation matrix (prepivotting) is what is best to do.

Flops: Of an order
(the same as the cost of the multiplication of two matrices, not so
bad).
Try a small perturbation of b to b* = b +
b, call the solution of the perturbed system:
x* = x +
x,


We will study it several times, here is my first introduction: a little made-up example with matlab to start with:
>> u=[3 1 -1 2]'
u = 3 1 -1 2 >> v=(1:4)' v = 1 2 3 4 >> X=u*v' X = 3 6 9 12 1 2 3 4 -1 -2 -3 -4 2 4 6 8 >>svd(X) ans = 21.2132 0.0000 0 0 >> E=10^(-3)*randn(4); >> XE=X+E XE = 3.0012 5.9993 9.0003 12.0012 1.0006 2.0017 3.0009 3.9994 -0.9999 -1.9999 -3.0014 -3.9994 2.0004 4.0018 5.9993 7.9996 >> svd(XE) ans = 21.2143 0.0028 0.0019 0.0005 >> cond(X) Condition is infinite ans = Inf >> cond(XE) ans = 4.2567e+04 %An example you can't see with your bare eyes: >> X2=u'*v X2 = 0.2468 0.2499 0.0166 0.2487 0.3225 0.3266 0.0218 0.3250 0.1495 0.1514 0.0101 0.1506 0.4367 0.4423 0.0295 0.4401 >> svd(X2) ans = 1.0730 0.0000 0 0 >> [U,S,V]=svd(X2); >> U*S*V' ans = 0.2468 0.2499 0.0166 0.2487 0.3225 0.3266 0.0218 0.3250 0.1495 0.1514 0.0101 0.1506 0.4367 0.4423 0.0295 0.4401 >> 10000*(X2-U*S*V') ans = 1.0e-11 * -0.0278 0.0833 -0.0035 0.0555 -0.0555 0.0555 0 0.0555 0 0.0555 0.0017 0.0278 -0.0555 0.1665 0 0.1110 |
>> A=rand(4)
A = 0.5045 0.4940 0.0737 0.9138 0.5163 0.2661 0.5007 0.5297 0.3190 0.0907 0.3841 0.4644 0.9866 0.9478 0.2771 0.9410 >> flops(0) >> [L,U,P]=lu(A) L = 1.0000 0 0 0 0.5233 1.0000 0 0 0.5114 -0.0406 1.0000 0 0.3234 0.9388 0.7363 1.0000 U = 0.9866 0.9478 0.2771 0.9410 0 -0.2298 0.3557 0.0373 0 0 -0.0535 0.4342 0 0 0 -0.1945 >> flops ans = 34 >> P*A ans = 0.9866 0.9478 0.2771 0.9410 0.5163 0.2661 0.5007 0.5297 0.5045 0.4940 0.0737 0.9138 0.3190 0.0907 0.3841 0.4644 >> L*U ans = 0.9866 0.9478 0.2771 0.9410 0.5163 0.2661 0.5007 0.5297 0.5045 0.4940 0.0737 0.9138 0.3190 0.0907 0.3841 0.4644 >> P*A-L*U ans = 1.0e-15 * 0 0 0 0 -0.1110 0 0 0 0 0 0 0 -0.0555 -0.0139 0 0 |
Here is what we need to remember:

The condition number in
2 norm has the simple interpretation as a function of the singular
values, ration of the largest to the smallest:

Read 2.6 of NR (pages 61-71).
Of an order of N steps are sufficient, see NR 51.
This enables, a the price of storing A, to improve the solution by using the residual. Do the following in double precision:


When there are more equations than variables, one can't hope for an exact solution, the system is overdetermined.
We try to minimize the error which is usually measured as the norm

Important remark: For any orthogonal Q, this norm is unchanged by trnasformation by Q, so

Least Squares in Two di erent ways:
>> t = (1900:10:1990)';
>> y = [75.995 91.972 105.711 123.203 131.669 ... 150.697 179.323 203.212 226.505 249.633]'; >> X=[t ones(10,1)] X = 1900 1 1910 1 1920 1 1930 1 1940 1 1950 1 1960 1 1970 1 1980 1 1990 1 >> bhat=X\p bhat = 1.93 -3594.01 >>yhat=polyval(bhat,t); >>[polyval(bhat,t) y] ans = 67.08 76.00 86.35 91.97 105.62 105.71 124.89 123.20 144.16 131.67 163.43 150.70 182.70 179.32 201.96 203.21 221.23 226.50 240.50 249.63 >> norm(yhat-y) ans = 23.5794 |
>> [Q,RC]=qr([X y]);
RC = -6151.30 -3.16 -488.86 0 -0.05 167.82 0 0 23.58 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> R1=RC(1:2,1:2) R1 = -6151.30 -3.16 0 -0.05 >> C1=RC(1:2,3) C1 = -488.8643 167.8181 >> R1\C1 ans = 1.93 -3594.01 >> C2=RC(3,3) C2 = 23.5794 >>[Q*[C1;zeros(8,1)] yhat] ans 67.08 67.08 86.35 86.35 105.62 105.62 124.89 124.89 144.16 144.16 163.43 163.43 182.70 182.70 201.96 201.96 221.23 221.23 240.50 240.50 |
What are Given's rotations good for?
You can use them to zero out individual isolated elements in any matrix, without changing any of the norms of the vectors, these transformations are orthogonal.
They are based on simple 2 × 2 ones of the form:
![[ ]
c s
- s c](t0520x.png)
You can choose c and s so that a certain vector is orthogonally transformed (actually, rotated) into a vector y = (y1,y2) whose second coordinate is zero.
This is of course easy by solving the equations

The examples that follow show two ways of doing this:
%%%%Van Loan's Function, Chapter 7%%%%%%%%
function [c,s] = Rotate(x1,x2); % Pre: % x1,x2 scalars % Post: % c,s c^2+s^2=1 so -s*x1 + c*x2 = 0. % if x2==0 c = 1; s = 0; else if abs(x2)>=abs(x1) cotangent = x1/x2; s = 1/sqrt(1+cotangent^2); c = s*cotangent; else tangent = x2/x1; c = 1/sqrt(1+tangent^2); s = c*tangent; end end %%%%%%%%%%%%%%%The resident function:%%%%%%% function [G,x] = planerot(x) %PLANEROT Generate a Givens plane rotation. % [G,Y] = PLANEROT(X), where X is a 2-component column vector, % returns a 2-by-2 orthogonal matrix G so that Y = G*X has Y(2) = 0. % See also QRINSERT, QRDELETE. if x(2) ~= 0 r = norm(x); G = [x'; -x(2) x(1)]/r; x = [r; 0]; else G = eye(2); end >> x=[2 3]'; >> [G,y]=planerot(x) G = 0.55 0.83 -0.83 0.55 y = 3.61 0 |
Now in order to extend this to a general context you just notice that if you fill out the rest of the matrix with 1's on the diagonal, it will not change the other coordinates thus:


backsub=function(X,y){
#Backsubstitution to find solution in a triangular case l=dim(X) p=l[2] b=matrix(0,p,1) b[p,1]=y[p,1]/X[pp] for (j in seq(p-1,1,-1){ b[j,1]=(y[j,1]-X[j,(j+1):n]*b[(j+1):n,1])/X[j,j] } return(b)} |
Algorithm (Householder Vector) in R
house= function(x){ #Householder vector calculation p = length(x); mu = sqrt(x%*%x) v = x if (mu!= 0) { beta = x[1] + sign(x[1])*mu v[2 : p,1] = v[2 : p,1]/beta } v[1] = 1 return(v) } |
Algorithm (Householder Vector) in matlab
function v = house(x); %Compute a Householder vector m = length(x); mu = norm(x,2); v = x; if mu ~= 0 beta = x(1) + sign(x(1))*mu; v(2 : m,1) = v(2 : m,1)/beta; end v(1) = 1 |
rowhouse=function(X,v){
beta=-2/(t(v)%*%v) w=beta*t(X)%*%v X=X+v%*%t(w) return(X)} |
qr package:base R Documentation
The QR Decomposition of a Matrix Description: 'qr' computes the QR decomposition of a matrix. It provides an interface to the techniques used in the LINPACK routine DQRDC or the LAPACK routines DGEQP3 and (for complex matrices) ZGEQP3. Usage: qr(x, tol = 1e-07 , LAPACK = FALSE) qr.coef(qr, y) qr.qy(qr, y) qr.qty(qr, y) qr.resid(qr, y) qr.fitted(qr, y, k = qr$rank) qr.solve(a, b, tol = 1e-7) ## S3 method for class 'qr': solve(a, b, ...) is.qr(x) as.qr(x) Arguments: x: a matrix whose QR decomposition is to be computed. tol: the tolerance for detecting linear dependencies in the columns of 'x'. Only used if 'LAPACK' is false and 'x' is real. qr: a QR decomposition of the type computed by 'qr'. y, b: a vector or matrix of right-hand sides of equations. a: A QR decomposition or ('qr.solve' only) a rectangular matrix. k: effective rank. LAPACK: logical. For real 'x', if true use LAPACK otherwise use LINPACK. ...: further arguments passed to or from other methods Details: The QR decomposition plays an important role in many statistical techniques. In particular it can be used to solve the equation *Ax* = *b* for given matrix *A*, and vector *b*. It is useful for computing regression coefficients and in applying the Newton-Raphson algorithm. The functions 'qr.coef', 'qr.resid', and 'qr.fitted' return the coefficients, residuals and fitted values obtained when fitting 'y' to the matrix with QR decomposition 'qr'. 'qr.qy' and 'qr.qty' return 'Q %*% y' and 't(Q) %*% y', where 'Q' is the (complete) *Q* matrix. All the above functions keep 'dimnames' (and 'names') of 'x' and 'y' if there are. 'solve.qr' is the method for 'solve' for 'qr' objects. 'qr.solve' solves systems of equations via the QR decomposition: if 'a' is a QR decomposition it is the same as 'solve.qr', but if 'a' is a rectangular matrix the QR decomposition is computed first. Either will handle over- and under-determined systems, providing a minimal-length solution or a least-squares fit if appropriate. 'is.qr' returns 'TRUE' if 'x' is a 'list' with components named 'qr', 'rank' and 'qraux' and 'FALSE' otherwise. It is not possible to coerce objects to mode '"qr"'. Objects either are QR decompositions or they are not. Value: The QR decomposition of the matrix as computed by LINPACK or LAPACK. The components in the returned value correspond directly to the values returned by DQRDC/DGEQP3/ZGEQP3. qr: a matrix with the same dimensions as 'x'. The upper triangle contains the *R* of the decomposition and the lower triangle contains information on the *Q* of the decomposition (stored in compact form). Note that the storage used by DQRDC and DGEQP3 differs. qraux: a vector of length 'ncol(x)' which contains additional information on *Q*. rank: the rank of 'x' as computed by the decomposition: always full rank in the LAPACK case. pivot: information on the pivoting strategy used during the decomposition. Non-complex QR objects computed by LAPACK have the attribute '"useLAPACK"' with value 'TRUE'. Note: To compute the determinant of a matrix (do you _really_ need it?), the QR decomposition is much more efficient than using Eigen values ('eigen'). See 'det'. Using LAPACK (including in the complex case) uses column pivoting and does not attempt to detect rank-deficient matrices. References: Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_. Wadsworth & Brooks/Cole. Dongarra, J. J., Bunch, J. R., Moler, C. B. and Stewart, G. W. (1978) _LINPACK Users Guide._ Philadelphia: SIAM Publications. Anderson. E. and ten others (1999) _LAPACK Users' Guide_. Third Edition. SIAM. Available on-line at <URL: http://www.netlib.org/lapack/lug/lapack_lug.html>. See Also: 'qr.Q', 'qr.R', 'qr.X' for reconstruction of the matrices. 'lm.fit', 'lsfit', 'eigen', 'svd'. 'det' (using 'qr') to compute the determinant of a matrix. |
Examples:
hilbert <- function(n) { i <- 1:n; 1 / outer(i - 1, i, "+") } h9 <- hilbert(9); h9 qr(h9)$rank #--> only 7 qrh9 <- qr(h9, tol = 1e-10) qrh9$rank #--> 9 ##-- Solve linear equation system H %*% x = y : y <- 1:9/10 x <- qr.solve(h9, y, tol = 1e-10) # or equivalently : x <- qr.coef(qrh9, y) #-- is == but much better than #-- solve(h9) %*% y h9 %*% x # = y |
h9 %*% (solve(h9) %*% y)
[,1] [1,] 0.09999996 [2,] 0.20000011 [3,] 0.30000012 [4,] 0.40000010 [5,] 0.50000008 [6,] 0.60000007 [7,] 0.70000005 [8,] 0.80000004 [9,] 0.90000003 > h9 %*% qr.coef(qrh9, y) [,1] [1,] 0.1 [2,] 0.2 [3,] 0.3 [4,] 0.4 [5,] 0.5 [6,] 0.6 [7,] 0.7 [8,] 0.8 [9,] 0.9 |
> round(h9,4)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 1.0000 0.5000 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 [2,] 0.5000 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 [3,] 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 [4,] 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 [5,] 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 [6,] 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 [7,] 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667 [8,] 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667 0.0625 [9,] 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667 0.0625 0.0588 |
Ridge regression in R
>library(MASS)
>help(lm.ridge) |
(see Faraways chapter 9 )
Example of a matlab ridge regression function:
function bks=ridge(Z,Y,kvalues)
% Ridge Function of Z (centered, explanatory) % Y is the response, % kvalues are the values where to compute [n,p]=size(Z); ZpY=Z'*Y; ZpZ=Z'*Z; m=length(kvalues); bks=ones(p,m); for k =1:m bks(:,k)=(ZpZ+diag(kvalues(k)))\ZpY; end >> kvalues=(0:.05:.5) kvalues = Columns 1 through 7 0 0.0500 0.1000 0.1500 0.2000 0.2500 0.3000 Columns 8 through 11 0.3500 0.4000 0.4500 0.5000 >> ridge(Z,Y,(0:.05:.5)) ans = Columns 1 through 7 1.5511 1.5176 1.4882 1.4622 1.4390 1.4183 1.3996 0.5102 0.4775 0.4488 0.4234 0.4009 0.3806 0.3624 0.1019 0.0678 0.0378 0.0113 -0.0122 -0.0334 -0.0524 -0.1441 -0.1762 -0.2043 -0.2292 -0.2514 -0.2713 -0.2892 Columns 8 through 11 1.3827 1.3673 1.3532 1.3403 0.3459 0.3309 0.3172 0.3046 -0.0696 -0.0853 -0.0997 -0.1128 -0.3054 -0.3201 -0.3336 -0.3460 %Formula gives for choice of k: >> norm(Yhat-Yc)^2 ans = 47.8636 >> 47.8636/(13-5) ans = 5.9829 % estimates the variance sigma^2 >> bk0'*bk0 ans = 2.6973 >> k=(4*5.9829)/2.6973 k = 8.8724 % This is a suggested value for k |
If we decompose the centered and rescaled matrix


A quick summary, all is detailed below:
new variables,

X Q
, U = D-
E, V = D-
A,
, r =
N . 1p, c =
N'. 1
n, X = rc',
(-)
,
n
p
2.Maximization of u'Au with u'Mu = 1
1) X centered, all points (observations) same weight.


The columns of V are the directions along which the variances are maximal.
Definition: Principal components are the coordinates of the observations on the basis of the new variables (namely the columns of V ) and they are the rows of G = XV = US. The components are orthogonal and their lengths are the singular values C'C = SU'US = S2.
In the same way the principal axes are defined as the rows of the matrix Z = U'X = SV '. Coordinates of the variables on the basis of columns of U.


The new model will be:

will be: V (p-k)V (p-k)
Example with R
First rescale to have unit variance:
> apply(Z,2,sd)
5.8824 15.5609 6.4051 16.7382 sweep(Z,2,sd) |
Then do the singular value decompositions:
> svd(Zt)$d
5.1796 4.3489 1.4964 0.1396 |
> U=svd(Zt)$u;S=diag(svd(Zt)$d)
> C=U%*%S; |
Singular Value Decomposition
Of numerical and statistical interest is the decomposition of X known as the s.v.d.
Up to now we have shown that X could be written X = UR, U orthogonal n × n R =
p × p
upper trianglar. The transformation of y by U' is c = U'y +
, and the solution to the U's problem is
provided by
s such that

such that it could be even simpler (diagonal): 

,p are called the singular values of X.
Columns of U are left-singular vectors, columns of V right-singular vectors.
Remark: When X is symmetric positive semi-definite V = V and the singular values of X are its
eigenvalues. On the other hand, X'X = V S2V '. The eigenvalues of X'X are the square of the singular
values of X (and XX'
U).
Statistically: E(Y ) = X
= USV '
= US
= XV
.
is the vector of regression coefficients on a new
collection of p orthogonal regressors and these coefficients are rotated versions of the original p
regression coefficients.

i) large, which mean that the data provides little information about the regression
coefficient on Xvi.
When is Si small?

Courant-Fischer Theorem:
Xn×p, n > p and singular values s1 > s2
> sp > 0.

sk is then the largest possible value for the length of X.
V i is the vector for which the corresponding linear combination of the columns of X has greatest sum of squares. (d1 is its square root).
If the column means have been removed from X then this corresponds to the linear combination that has the largest variance.
max I(v'
: u'y)
s
12
= (
1,
,
p) parameter of interest
log likelihood
(
) achieves its maximum value for
(
) =
= 0p, where

to be a minimum in the vector case
(
) must also be negative definite.
The Fisher information or information matrix for
is defined to be
![I(h : x) = [- Eh(l¨x( )]h = Eh [˙lx( )][ ˙lx( )]'|h](t0563x.png)
is
0, the asymptotic variance matrix of the MLE
is
given by the inverse of the information matrix.
e = Y - X
r = rank(X) k2(X) = d1/dr.
TSS = SSM + RSS
Total = Model + Residual
R2 = SSM/TSS = 1 - RSS/TSS.
Theorem:
= max
<
.
estimate for the perturbed problem of reg. coe .
ê estimate of residually perturbed problem of reg. coe .
For R2 < 1

What if X is not full rank?
This will often happen when coding for analysis of variance.
X'X singular, the vector of coefficients
is not estimable although some linear combinations of
's are
estimable. (Those for which I(c'
: Y ) = c'X'Xc are nonzero.
c'
estimable if there exists a vector
such that

(b) equivalent to c'(X'X)-X'
=
c'.


+1 =
r+2 =
=
p = 0.
(Otherwise
)
X+ is the Moore-Penrose pseudo-inverse of X and if X n×p then Xp×n+ is defined as the unique matrix for which: