Computer Science 106(Introd. to Computer Programming) or equivalent
Stat 116 (Probability) or equivalent
Math 113 (Matrix Algebra) or equivalent
Computational Statistics, Geof Givens and Jennifer Hoeting. Wiley 2005
There will be four homeworks/labs to hand in 40 % and one big project ( final) 60 %
Hint: the more original the work the higher the grade.
Professor Susan Holmes: Wednesday at 1.45, in Sequoia 102.
By email appointment to susan at stat.stanford.edu.
TA: Patrick Perry (patperry at stanford.edu), room 108.
Office hours: Monday 2.30-4.00pm.
This will contain a bulletin board, homeworks, course summary, project description list, reading list, links to useful sites with in particular tutorials, software information, etc...
Address:http://www-stat.edu/~susan/courses/s227/
Weekly consultation of the web site will be necessary and expected of all students. Homework and lab. assignments and solutions will be on the course's coursework site.
There will be 3 sets of labs with exercices on accessing the web and usin R.
Times: TBA
The course will contain 3 parts, here is a list containing both subjects covered in class and potential project matter:
Maximum Likelihood, EM algorithm, Missing Data,
Bootstrap, Cross-Validation
Step by step multivariate regression, robust regression,
Non-parametric Regression, Alternate Conditional Expectation, Projection Pursuit,
(Neural Nets)
Principal Components, Correspondence Analysis,
Classification and Regression Trees, Clustering,
Nonparametric Confidence Regions, (convex hulls),
Approximate Counting, Exhaustive Enumeration, Non parametric Tests.
Matrix manipulations, decompositions, (cholesky, QR,...),
Singular Value Decompositions, Eigenanalysis,
Iterative methods, Gauss-Seidel, Conjugate Gradient,
Downhill Simplex minimization,
Smoothing,
Random Number Generation ,
(Uniform, Normal, Beta, Inverse CDF, Acceptance Rejection, Metropolis),
Markov Chain approach to Combinatorial Optimization (Simulated Annealing), Updating,
Markov Chain Monte Carlo, Gray Codes
Sorting.
Most methods exist in high level programming environments such as R.
We will learn about several additional R functions/packages such as utils (Sweave), mva,ade4, boot,
MCMCpack, Rggobi, rpart, rrcov, modreg,....
Symbolic software will be introduced, if not used much. (maple)
Hw 1 to hand in before Tuesday, April 19th at 4.30pm Rlab1.pdf
See coursework for solutions.
Hw 2 to hand in before Thursday, April 28th at 4.15pm hw2.pdf
Hw 3 to hand in before Monday, May 9th at 4.15pm Chapter 6, problems 6.4, 6.5 and
6.6.
Hint: If you are having trouble with the problem 6.4, there is a di erent approach using Gibbs
sampling that is given in detail here:
http://psblade.ucdavis.edu/BMSBSA/chapter09.ps, this might help you somewhat with a
simplification of your calculations.
Hw 4 to hand in before Monday, May 23th at 4.15pm
Don’t make a habit of it :75 minutes gained will probably cost you the triple to catch up!
I won't be explicitly writing all my lectures as notes on the web, but for the benefit of those who missed a lecture, here is a little of what I said, so you get an idea about at what level the course will be:
![e
x = ± [.d1d2d3 ...dt]× b](t050x.png)
![x = (- 1)s[.d d d d ...d ]× bE -e
0 1 2 3 t- 1](t051x.png)
These are normalized so that the first digit of the mantissa is non zero, in the binary mode, (
= 2),
this means we don't have to code the first one, sometimes it is left out (phantom of the
digit).
If x <
Emin-1 (smallest number representable in machine)
we have underflow,
if x >
Emax(1 -
-t) overflow.
Otherwise x is in the representable range, maybe we won't be able to represent it exactly. Macine
Accuracy:
Smallest fp number added to 1 that gives an answer di erent from 1. (eps in matlab)
Now if we can't represent x without representation error, we'll call x*
Taylor series show how much we are going to be o .
When error growth is linear, the sensitivity is small: f is said to be well-conditionned.
relative to
the data x and the problem S: is the constant that bounds the relative error.

Relative to the norm, the problem and that particular data x.
Worse still, we usually don't have f, but an approximation of it. this induces a second type of error: generated error, also called truncation error because we stop computing before we have the exact right answer(for an integral computed by trapezoids, or any iterative method for computing functions by series, ...) for example.
![f (x)- f * (x*) = [f(x) - f(x*)] + [f(x*) - f *(x*)]](t053x.png)
close to x such that f * (x) = f(
), then the f* algorithm is said to be
stable.
An unstable algorithm is one where the roundo error gets mixed into the calculation at an early stage and is successively magnified until it swamps the true answer.
Example: Want to compute the powers of the golden mean:



Error analysis:
and keep track of all the fp operations and how they make the
errors evolve.
Moment computations, (see Pages 612-613 of Numerical Recipes)
A little example of how mistakes tote up on a total.
Vectors These are the basic building blocks of the new generation of languages, unlike C or fortran whose are fp numbers, the gains are important in writing all computations vectorially.
Matrices Notations, operations, partitionning, careful with the defaults from one system to another.
The Botany of Matrices Diagonal, Symmetric, Banded, Hessenberg, Circulant, Hankel, Triangular,
Sparse, Orthogonal, semi-positive definite,
My notations for: transpose, inner product,norm, matrix norms,
Suppose we have a linear system of equtions, typified by the matrix of coefficients A, unknowns, x and the right hand sides, the vector b.
Close singularity is just as bad because the roundo could take us there, we will have to measure this.
Single precision is usually enough for 30-50 equations.
Double precision makes it possible to solve for a few hundred.
This does NOT hold for pathological systems we will look into these and how to recognize them when we talk about conditionning.
Suppose the matrix of coefficients is a square matrix, (of full rank also, as we saw, non-degenerate) then the equation looks like this:

Formally you would say : invert it, in fact a computational mistake, except if you have multiple right hand sides.
Doing what one would first think of doing if you had to solve a system by hand.
The easiest system: triangular.
Bringing to triangular shape.
Pivotting. This is NOT to solve ill-conditionning, but unlucky diagonal elements that are very small make the algorithm unstable.
LU decomposition, with a permutation matrix (prepivotting) is what is best to do.

Flops: Of an order
(the same as the cost of the multiplication of two matrices, not so
bad).
Try a small perturbation of b to b* = b +
b, call the solution of the perturbed system:
x* = x +
x,


We will study it several times, here is my first introduction: a little made-up example with matlab to start with:
>> u=[3 1 -1 2]'
u = 3 1 -1 2 >> v=(1:4)' v = 1 2 3 4 >> X=u*v' X = 3 6 9 12 1 2 3 4 -1 -2 -3 -4 2 4 6 8 >>svd(X) ans = 21.2132 0.0000 0 0 >> E=10^(-3)*randn(4); >> XE=X+E XE = 3.0012 5.9993 9.0003 12.0012 1.0006 2.0017 3.0009 3.9994 -0.9999 -1.9999 -3.0014 -3.9994 2.0004 4.0018 5.9993 7.9996 >> svd(XE) ans = 21.2143 0.0028 0.0019 0.0005 >> cond(X) Condition is infinite ans = Inf >> cond(XE) ans = 4.2567e+04 %An example you can't see with your bare eyes: >> X2=u'*v X2 = 0.2468 0.2499 0.0166 0.2487 0.3225 0.3266 0.0218 0.3250 0.1495 0.1514 0.0101 0.1506 0.4367 0.4423 0.0295 0.4401 >> svd(X2) ans = 1.0730 0.0000 0 0 >> [U,S,V]=svd(X2); >> U*S*V' ans = 0.2468 0.2499 0.0166 0.2487 0.3225 0.3266 0.0218 0.3250 0.1495 0.1514 0.0101 0.1506 0.4367 0.4423 0.0295 0.4401 >> 10000*(X2-U*S*V') ans = 1.0e-11 * -0.0278 0.0833 -0.0035 0.0555 -0.0555 0.0555 0 0.0555 0 0.0555 0.0017 0.0278 -0.0555 0.1665 0 0.1110 |
>> A=rand(4)
A = 0.5045 0.4940 0.0737 0.9138 0.5163 0.2661 0.5007 0.5297 0.3190 0.0907 0.3841 0.4644 0.9866 0.9478 0.2771 0.9410 >> flops(0) >> [L,U,P]=lu(A) L = 1.0000 0 0 0 0.5233 1.0000 0 0 0.5114 -0.0406 1.0000 0 0.3234 0.9388 0.7363 1.0000 U = 0.9866 0.9478 0.2771 0.9410 0 -0.2298 0.3557 0.0373 0 0 -0.0535 0.4342 0 0 0 -0.1945 >> flops ans = 34 >> P*A ans = 0.9866 0.9478 0.2771 0.9410 0.5163 0.2661 0.5007 0.5297 0.5045 0.4940 0.0737 0.9138 0.3190 0.0907 0.3841 0.4644 >> L*U ans = 0.9866 0.9478 0.2771 0.9410 0.5163 0.2661 0.5007 0.5297 0.5045 0.4940 0.0737 0.9138 0.3190 0.0907 0.3841 0.4644 >> P*A-L*U ans = 1.0e-15 * 0 0 0 0 -0.1110 0 0 0 0 0 0 0 -0.0555 -0.0139 0 0 |
Here is what we need to remember:

The condition number in
2 norm has the simple interpretation as a function of the singular
values, ration of the largest to the smallest:

Read 2.6 of NR (pages 61-71).
Of an order of N steps are sufficient, see NR 51.
This enables, a the price of storing A, to improve the solution by using the residual. Do the following in double precision:


When there are more equations than variables, one can't hope for an exact solution, the system is overdetermined.
We try to minimize the error which is usually measured as the norm

Important remark: For any orthogonal Q, this norm is unchanged by trnasformation by Q, so

Least Squares in Two di erent ways:
>> t = (1900:10:1990)';
>> y = [75.995 91.972 105.711 123.203 131.669 ... 150.697 179.323 203.212 226.505 249.633]'; >> X=[t ones(10,1)] X = 1900 1 1910 1 1920 1 1930 1 1940 1 1950 1 1960 1 1970 1 1980 1 1990 1 >> bhat=X\p bhat = 1.93 -3594.01 >>yhat=polyval(bhat,t); >>[polyval(bhat,t) y] ans = 67.08 76.00 86.35 91.97 105.62 105.71 124.89 123.20 144.16 131.67 163.43 150.70 182.70 179.32 201.96 203.21 221.23 226.50 240.50 249.63 >> norm(yhat-y) ans = 23.5794 |
>> [Q,RC]=qr([X y]);
RC = -6151.30 -3.16 -488.86 0 -0.05 167.82 0 0 23.58 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> R1=RC(1:2,1:2) R1 = -6151.30 -3.16 0 -0.05 >> C1=RC(1:2,3) C1 = -488.8643 167.8181 >> R1\C1 ans = 1.93 -3594.01 >> C2=RC(3,3) C2 = 23.5794 >>[Q*[C1;zeros(8,1)] yhat] ans 67.08 67.08 86.35 86.35 105.62 105.62 124.89 124.89 144.16 144.16 163.43 163.43 182.70 182.70 201.96 201.96 221.23 221.23 240.50 240.50 |
What are Given's rotations good for?
You can use them to zero out individual isolated elements in any matrix, without changing any of the norms of the vectors, these transformations are orthogonal.
They are based on simple 2 × 2 ones of the form:
![[ ]
c s
- s c](t0520x.png)
You can choose c and s so that a certain vector is orthogonally transformed (actually, rotated) into a vector y = (y1,y2) whose second coordinate is zero.
This is of course easy by solving the equations

The examples that follow show two ways of doing this:
%%%%Van Loan's Function, Chapter 7%%%%%%%%
function [c,s] = Rotate(x1,x2); % Pre: % x1,x2 scalars % Post: % c,s c^2+s^2=1 so -s*x1 + c*x2 = 0. % if x2==0 c = 1; s = 0; else if abs(x2)>=abs(x1) cotangent = x1/x2; s = 1/sqrt(1+cotangent^2); c = s*cotangent; else tangent = x2/x1; c = 1/sqrt(1+tangent^2); s = c*tangent; end end %%%%%%%%%%%%%%%The resident function:%%%%%%% function [G,x] = planerot(x) %PLANEROT Generate a Givens plane rotation. % [G,Y] = PLANEROT(X), where X is a 2-component column vector, % returns a 2-by-2 orthogonal matrix G so that Y = G*X has Y(2) = 0. % See also QRINSERT, QRDELETE. if x(2) ~= 0 r = norm(x); G = [x'; -x(2) x(1)]/r; x = [r; 0]; else G = eye(2); end >> x=[2 3]'; >> [G,y]=planerot(x) G = 0.55 0.83 -0.83 0.55 y = 3.61 0 |
Now in order to extend this to a general context you just notice that if you fill out the rest of the matrix with 1's on the diagonal, it will not change the other coordinates thus:


backsub=function(X,y){
#Backsubstitution to find solution in a triangular case l=dim(X) p=l[2] b=matrix(0,p,1) b[p,1]=y[p,1]/X[pp] for (j in seq(p-1,1,-1){ b[j,1]=(y[j,1]-X[j,(j+1):n]*b[(j+1):n,1])/X[j,j] } return(b)} |
Algorithm (Householder Vector) in R
house= function(x){ #Householder vector calculation p = length(x); mu = sqrt(x%*%x) v = x if (mu!= 0) { beta = x[1] + sign(x[1])*mu v[2 : p,1] = v[2 : p,1]/beta } v[1] = 1 return(v) } |
Algorithm (Householder Vector) in matlab
function v = house(x); %Compute a Householder vector m = length(x); mu = norm(x,2); v = x; if mu ~= 0 beta = x(1) + sign(x(1))*mu; v(2 : m,1) = v(2 : m,1)/beta; end v(1) = 1 |
rowhouse=function(X,v){
beta=-2/(t(v)%*%v) w=beta*t(X)%*%v X=X+v%*%t(w) return(X)} |
qr package:base R Documentation
The QR Decomposition of a Matrix Description: 'qr' computes the QR decomposition of a matrix. It provides an interface to the techniques used in the LINPACK routine DQRDC or the LAPACK routines DGEQP3 and (for complex matrices) ZGEQP3. Usage: qr(x, tol = 1e-07 , LAPACK = FALSE) qr.coef(qr, y) qr.qy(qr, y) qr.qty(qr, y) qr.resid(qr, y) qr.fitted(qr, y, k = qr$rank) qr.solve(a, b, tol = 1e-7) ## S3 method for class 'qr': solve(a, b, ...) is.qr(x) as.qr(x) Arguments: x: a matrix whose QR decomposition is to be computed. tol: the tolerance for detecting linear dependencies in the columns of 'x'. Only used if 'LAPACK' is false and 'x' is real. qr: a QR decomposition of the type computed by 'qr'. y, b: a vector or matrix of right-hand sides of equations. a: A QR decomposition or ('qr.solve' only) a rectangular matrix. k: effective rank. LAPACK: logical. For real 'x', if true use LAPACK otherwise use LINPACK. ...: further arguments passed to or from other methods Details: The QR decomposition plays an important role in many statistical techniques. In particular it can be used to solve the equation *Ax* = *b* for given matrix *A*, and vector *b*. It is useful for computing regression coefficients and in applying the Newton-Raphson algorithm. The functions 'qr.coef', 'qr.resid', and 'qr.fitted' return the coefficients, residuals and fitted values obtained when fitting 'y' to the matrix with QR decomposition 'qr'. 'qr.qy' and 'qr.qty' return 'Q %*% y' and 't(Q) %*% y', where 'Q' is the (complete) *Q* matrix. All the above functions keep 'dimnames' (and 'names') of 'x' and 'y' if there are. 'solve.qr' is the method for 'solve' for 'qr' objects. 'qr.solve' solves systems of equations via the QR decomposition: if 'a' is a QR decomposition it is the same as 'solve.qr', but if 'a' is a rectangular matrix the QR decomposition is computed first. Either will handle over- and under-determined systems, providing a minimal-length solution or a least-squares fit if appropriate. 'is.qr' returns 'TRUE' if 'x' is a 'list' with components named 'qr', 'rank' and 'qraux' and 'FALSE' otherwise. It is not possible to coerce objects to mode '"qr"'. Objects either are QR decompositions or they are not. Value: The QR decomposition of the matrix as computed by LINPACK or LAPACK. The components in the returned value correspond directly to the values returned by DQRDC/DGEQP3/ZGEQP3. qr: a matrix with the same dimensions as 'x'. The upper triangle contains the *R* of the decomposition and the lower triangle contains information on the *Q* of the decomposition (stored in compact form). Note that the storage used by DQRDC and DGEQP3 differs. qraux: a vector of length 'ncol(x)' which contains additional information on *Q*. rank: the rank of 'x' as computed by the decomposition: always full rank in the LAPACK case. pivot: information on the pivoting strategy used during the decomposition. Non-complex QR objects computed by LAPACK have the attribute '"useLAPACK"' with value 'TRUE'. Note: To compute the determinant of a matrix (do you _really_ need it?), the QR decomposition is much more efficient than using Eigen values ('eigen'). See 'det'. Using LAPACK (including in the complex case) uses column pivoting and does not attempt to detect rank-deficient matrices. References: Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_. Wadsworth & Brooks/Cole. Dongarra, J. J., Bunch, J. R., Moler, C. B. and Stewart, G. W. (1978) _LINPACK Users Guide._ Philadelphia: SIAM Publications. Anderson. E. and ten others (1999) _LAPACK Users' Guide_. Third Edition. SIAM. Available on-line at <URL: http://www.netlib.org/lapack/lug/lapack_lug.html>. See Also: 'qr.Q', 'qr.R', 'qr.X' for reconstruction of the matrices. 'lm.fit', 'lsfit', 'eigen', 'svd'. 'det' (using 'qr') to compute the determinant of a matrix. |
Examples:
hilbert <- function(n) { i <- 1:n; 1 / outer(i - 1, i, "+") } h9 <- hilbert(9); h9 qr(h9)$rank #--> only 7 qrh9 <- qr(h9, tol = 1e-10) qrh9$rank #--> 9 ##-- Solve linear equation system H %*% x = y : y <- 1:9/10 x <- qr.solve(h9, y, tol = 1e-10) # or equivalently : x <- qr.coef(qrh9, y) #-- is == but much better than #-- solve(h9) %*% y h9 %*% x # = y |
h9 %*% (solve(h9) %*% y)
[,1] [1,] 0.09999996 [2,] 0.20000011 [3,] 0.30000012 [4,] 0.40000010 [5,] 0.50000008 [6,] 0.60000007 [7,] 0.70000005 [8,] 0.80000004 [9,] 0.90000003 > h9 %*% qr.coef(qrh9, y) [,1] [1,] 0.1 [2,] 0.2 [3,] 0.3 [4,] 0.4 [5,] 0.5 [6,] 0.6 [7,] 0.7 [8,] 0.8 [9,] 0.9 |
> round(h9,4)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 1.0000 0.5000 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 [2,] 0.5000 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 [3,] 0.3333 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 [4,] 0.2500 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 [5,] 0.2000 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 [6,] 0.1667 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 [7,] 0.1429 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667 [8,] 0.1250 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667 0.0625 [9,] 0.1111 0.1000 0.0909 0.0833 0.0769 0.0714 0.0667 0.0625 0.0588 |
Ridge regression in R
>library(MASS)
>help(lm.ridge) |
(see Faraways chapter 9 )
Example of a matlab ridge regression function:
function bks=ridge(Z,Y,kvalues)
% Ridge Function of Z (centered, explanatory) % Y is the response, % kvalues are the values where to compute [n,p]=size(Z); ZpY=Z'*Y; ZpZ=Z'*Z; m=length(kvalues); bks=ones(p,m); for k =1:m bks(:,k)=(ZpZ+diag(kvalues(k)))\ZpY; end >> kvalues=(0:.05:.5) kvalues = Columns 1 through 7 0 0.0500 0.1000 0.1500 0.2000 0.2500 0.3000 Columns 8 through 11 0.3500 0.4000 0.4500 0.5000 >> ridge(Z,Y,(0:.05:.5)) ans = Columns 1 through 7 1.5511 1.5176 1.4882 1.4622 1.4390 1.4183 1.3996 0.5102 0.4775 0.4488 0.4234 0.4009 0.3806 0.3624 0.1019 0.0678 0.0378 0.0113 -0.0122 -0.0334 -0.0524 -0.1441 -0.1762 -0.2043 -0.2292 -0.2514 -0.2713 -0.2892 Columns 8 through 11 1.3827 1.3673 1.3532 1.3403 0.3459 0.3309 0.3172 0.3046 -0.0696 -0.0853 -0.0997 -0.1128 -0.3054 -0.3201 -0.3336 -0.3460 %Formula gives for choice of k: >> norm(Yhat-Yc)^2 ans = 47.8636 >> 47.8636/(13-5) ans = 5.9829 % estimates the variance sigma^2 >> bk0'*bk0 ans = 2.6973 >> k=(4*5.9829)/2.6973 k = 8.8724 % This is a suggested value for k |
If we decompose the centered and rescaled matrix


A quick summary, all is detailed below:
new variables,

X Q
, U = D-
E, V = D-
A,
, r =
N . 1p, c =
N'. 1
n, X = rc',
(-)
,
n
p
2.Maximization of u'Au with u'Mu = 1
1) X centered, all points (observations) same weight.


The columns of V are the directions along which the variances are maximal.
Definition: Principal components are the coordinates of the observations on the basis of the new variables (namely the columns of V ) and they are the rows of G = XV = US. The components are orthogonal and their lengths are the singular values C'C = SU'US = S2.
In the same way the principal axes are defined as the rows of the matrix Z = U'X = SV '. Coordinates of the variables on the basis of columns of U.


The new model will be:

will be: V (p-k)V (p-k)
Example with R
First rescale to have unit variance:
> apply(Z,2,sd)
5.8824 15.5609 6.4051 16.7382 sweep(Z,2,sd) |
Then do the singular value decompositions:
> svd(Zt)$d
5.1796 4.3489 1.4964 0.1396 |
> U=svd(Zt)$u;S=diag(svd(Zt)$d)
> C=U%*%S; |
Singular Value Decomposition
Of numerical and statistical interest is the decomposition of X known as the s.v.d.
Up to now we have shown that X could be written X = UR, U orthogonal n × n R =
p × p
upper trianglar. The transformation of y by U' is c = U'y +
, and the solution to the U's problem is
provided by
s such that

such that it could be even simpler (diagonal): 

,p are called the singular values of X.
Columns of U are left-singular vectors, columns of V right-singular vectors.
Remark: When X is symmetric positive semi-definite V = V and the singular values of X are its
eigenvalues. On the other hand, X'X = V S2V '. The eigenvalues of X'X are the square of the singular
values of X (and XX'
U).
Statistically: E(Y ) = X
= USV '
= US
= XV
.
is the vector of regression coefficients on a new
collection of p orthogonal regressors and these coefficients are rotated versions of the original p
regression coefficients.

i) large, which mean that the data provides little information about the regression
coefficient on Xvi.
When is Si small?

Courant-Fischer Theorem:
Xn×p, n > p and singular values s1 > s2
> sp > 0.

sk is then the largest possible value for the length of X.
V i is the vector for which the corresponding linear combination of the columns of X has greatest sum of squares. (d1 is its square root).
If the column means have been removed from X then this corresponds to the linear combination that has the largest variance.
max I(v'
: u'y)
s
12
= (
1,
,
p) parameter of interest
log likelihood
(
) achieves its maximum value for
(
) =
= 0p, where

to be a minimum in the vector case
(
) must also be negative definite.
The Fisher information or information matrix for
is defined to be
![I(h : x) = [- Eh(l¨x( )]h = Eh [˙lx( )][ ˙lx( )]'|h](t0563x.png)
is
0, the asymptotic variance matrix of the MLE
is
given by the inverse of the information matrix.
e = Y - X
r = rank(X) k2(X) = d1/dr.
TSS = SSM + RSS
Total = Model + Residual
R2 = SSM/TSS = 1 - RSS/TSS.
Theorem:
= max
<
.
estimate for the perturbed problem of reg. coe .
ê estimate of residually perturbed problem of reg. coe .
For R2 < 1

What if X is not full rank?
This will often happen when coding for analysis of variance.
X'X singular, the vector of coefficients
is not estimable although some linear combinations of
's are
estimable. (Those for which I(c'
: Y ) = c'X'Xc are nonzero.
c'
estimable if there exists a vector
such that

(b) equivalent to c'(X'X)-X'
=
c'.


+1 =
r+2 =
=
p = 0.
(Otherwise
)
X+ is the Moore-Penrose pseudo-inverse of X and if X n×p then Xp×n+ is defined as the unique matrix for which:
The solution
= X+Y is equivalent to
= (X'X)+X'Y .
Summary:
B) AB-1 B-
A B-
.
symmetrical eigenvalue problems.
Z tranforms A's structure.
1 simple v1,x0 non
v2 Xi = A
.
nA?nx
0 = wi
in
, || xn ||
1.
1.
[v1
vk],Ri
[
1
k].
k+1/
k.
p+1/
p. A -
I and A same eigenvectors and
eigenvalues di er by
.
so that (
k+1/
k) small (
close to
k+1).
k+1 not known, Qpp good or eigenvalue of lower block of A.
I.
, B positive o -diagonal elements
then Q and B are determined as soon as the first col. of Q is.
iI = Q'(A
2 -
iI)Qi +
iI = Qi'A
iQi.
Jp,Jk = Gk+1k.
function [vec,value]=power(start,A,toler)
% %Power method for computing eigenvalues % dd=1; x=start; n=10; while dd> toler y=A*x dd=abs(norm(x)-n); n=norm(x) x=y/n pause end vec=x; value=n; |
>> v1
v1 = 1 1 1 >> A A = 1 1 1 1 2 3 1 3 6 |
>> power(v1,A,1/10)
y = 3 6 10 dd = 8.2679 x = 1.7321 3.4641 5.7735 y =10.9697 25.9808 46.7654 n = 6.9522 x = 1.5779 3.7370 6.7267 y = 12.0416 29.2320 53.1491 n = 7.8552 x = 1.5330 3.7214 6.7661 y = 12.0205 29.2741 53.2940 n = 7.8727 x = 1.5269 3.7184 6.7695 |
![[v3 e3] = power(v1, A, 1/10)](t0582x.png)
v3 =
0.1939 0.4723 0.8598 e3 = 7.8727 >> [ve,ei]=eig(A) ve = 0.5438 -0.8165 0.1938 -0.7812 -0.4082 0.4722 0.3065 0.4082 0.8599 ei = 0.1270 0 0 0 1.0000 0 0 0 7.8730 |
Motivation:
A method for finding maximum likelihood estimates,
-either in presence of missing data.
-or when the model can be simplified by adding `latent parameters'.
Some references:
Hartley, 1958, Biometrics, 174-194, is a good starting place because he does simple examples.
Apparently the method was discovered by Oscar Rothaus while he was at IDA, with Baum, who then published several papers of which an example is Baum, Petrie, Soules and Weiss, (1970), Ann. Math. Stat., 164-171 who use the idea for speech recognition.
The main reference that people cite is Dempster, Laird et al. (1977), JRSS B - they coined the name EM.
It is hard to motivate with a `trivial example' because the the likelihood is easy to di erentiate.
Most of the elementary examples come from the multinomial likelihoods.
A very recent reference is the book called: The EM algorithm and Extensions, by G.J. McLachlan and T. Krishnan,(1997), Wiley.
Here is a first motivating example
Example 1:
Multinomial counts of animals into 4 categories


Then we can consider that this is in fact a five-category problem with first two categories being split.
Call
= (1 -
)2, then the density is proportional to




maple
|\^/| Maple V Release 4 (WMI Campus WIde License) ._|\| |/|_. Copyright (c) 1981-1996 by Waterloo Maple Inc. All rights \ MAPLE / reserved. Maple and Maple V are registered trademarks of <____ ____> Waterloo Maple Inc. | Type ? for help. > L:=psi->((2+psi)^y1)*((1-psi)^y2)*((1-psi)^y3)*psi^y4; y1 y2 y3 y4 psi -> L(psi) = (2 + psi) (1 - psi) (1 - psi) psi > ell:=x->log(L(x)); > ell(psi); y1 y2 y3 y4 ln((2 + psi) (1 - psi) (1 - psi) psi ) > expand("); y1 ln(2 + psi) + y2 ln(1 - psi) + y3 ln(1 - psi) + y4 ln(psi) > diff("",psi); y1 y2 y3 y4 ------- - ------- - ------- + --- 2 + psi 1 - psi 1 - psi psi > diff(",psi); y1 y2 y3 y4 - ---------- - ---------- - ---------- - ---- 2 2 2 2 (2 + psi) (1 - psi) (1 - psi) psi > Ifish(psi):=-"; y1 y2 y3 y4 Ifish(psi) := ---------- + ---------- + ---------- + ---- 2 2 2 2 (2 + psi) (1 - psi) (1 - psi) psi > lprime:=diff(ell(psi),psi); > sols:= solve(lprime=0,psi); 2 2 sols:=1/2 (y1 - y4 - 2 y3 - 2 y2 + (y1 + 6 y1 y4 - 4 y1 y3 - 4 y1 y2 + 9 y4 2 2 1/2 + 12 y3 y4 + 12 y2 y4 + 4 y3 + 8 y3 y2 + 4 y2 ) )/(y1 + y2 + y3 + y4), 2 2 1/2 (y1 - y4 - 2 y3 - 2 y2 - (y1 + 6 y1 y4 - 4 y1 y3 - 4 y1 y2 + 9 y4 2 2 1/2 + 12 y3 y4 + 12 y2 y4 + 4 y3 + 8 y3 y2 + 4 y2 ) )/(y1 + y2 + y3 + y4) |
> y1:=125;y2:=18;y3:=20;y4:=34;
> evalf(sols); .6268214980, -.5506793660 > y1:=1997;y2:=906;y3:=904;y4:=32; > plot(ell(psi),psi=0..0.5); + AA + AAA AAAAAAA + AAAA 1200 AAAA + AAA + AAA + AAA 1100 AAA + AAA + AAA + AAA + AA 1000 AA + AA + AA + AA 900 AAA + AA + AA + AA 800 AA + AA + AA + AA + AA 700 A + A + AA + AA 600 AA + A + AA +--+--+--+--+--+--+--+--+--+--+--+--+-+--+--+--+--+--+--+--+--+--+--+--+--* 0 ^ 0.1 0.2 0.3 0.4 0.5 | > evalf(sols); .0357123023, -.4668141517 > y1:='y1';y2:='y2';y3:='y3';y4:='y4'; |
The solutions are:


Now, suppose we didn't have the equation solver available and that we want to nd the solution using the EM algorithm.
Split the rst cell into two cells with respective counts y11 + y12 = y1, then the new likelihood is proportional to:


is:

L:=psi->((2^y11)*(psi)^y12)*((1-psi)^y2)*((1-psi)^y3)*psi^y4;
y11 y12 y_{2} y3 y4 L := psi -> 2 psi (1 - psi) (1 - psi) psi > ell(psi); y11 y12 y2 y3 y4 ln(2 psi (1 - psi) (1 - psi) psi ) > diff(",psi); > simplify("); -y12 + y12 psi + y2 psi + y3 psi - y4 + y4 psi ---------------------------------------------- psi (-1 + psi) > solve("=0,psi); y12 + y4 ------------------ y12 + y2 + y3 + y4 |

![(0)
Q(y; y ) = Ey(0)[logLc(y)|y]](t0595x.png)
)is a linear function of the unobservables y11 and y12, we replace them with their current
conditional expectations given the observed y.


The maximisation step doesn't need any more computations, as now we have values for the unobsevrables that we just plug into


Here is the little R function that I used to check the computations:
emalgo=function(y1,y2,y3,y4,tol=0.0005,start0=0.2)
{ #EM algorithm for Rao's multinomial model # #y1,y2,y_3,y_4 are the observed frequencies # #Initializations psilast=0; n=y1+y2+y3+y4; psicurrent=start0; psi=psicurrent; while (abs(psilast-psi)>tol ){ v=estep(psicurrent,y1) y11=v[1] y12=v[2] psi=mstep(y12,y11,y4,n) psilast=psicurrent psicurrent=psi cat('psicurrent=',psicurrent,'\n') }} mstep=function(y12,y11,y4,n){ return((y12+y4)/(n-y11))} estep=function(psicurrent,y1){ y11=(.5*y1)/(.5+psicurrent/4); y12=y1-y11 return(c(y11,y12))} |
Results:
tol=10^(-7);
start0=.05; emalgo(y1,y2,y3,y4,tol,start0) a=emalgo(y1,y2,y3,y4,tol=10^(-7),start0) psicurrent= 0.4936627 psicurrent= 0.6072183 psicurrent= 0.6241805 psicurrent= 0.6264701 psicurrent= 0.6267748 psicurrent= 0.6268153 psicurrent= 0.6268207 psicurrent= 0.6268214 psicurrent= 0.6268215 |
Suppose the density is of the form:

= (
1,
2,...,
g-1) y = (w1,...wn) is observed
from the mixture. The loglikelihood from the observed is:

) with regards to
i and then we have to nd the likelihood equations:

If we observed the z's, the mle of
i would be
i =
.
Take the new, complete data vector to be x = (y,z).

i, we ignore it.
The E-step:


Suppose the complete density is of the form:

.
![@loga(y)
Ey[t(X)] = ---------
@y](t05109x.png)
![Q(y, y(k)) = y'Ey(k)[t(X)|y]- loga(y)](t05110x.png)
and you nd that to maximise Q you must take:
![Ey[t(X)] = t(k) = Ey(k)[t(X) |y]](t05111x.png)
![@loga(y)@y = Ey(k)[t(X) |y]](t05112x.png)
As an application, you can consider any multinomial as an exponential family, (see Lehmann, Theory of Point Estimation, page 28), so that they all enter this case.
A particular example was the one I did the rst day where the new parameter space had a natural interpretation, that of the gene counting example, where the rst mutinomial is the observed one of the frequencies of the observed phenotypes. y = (nA,nB,nAB,nO), the parameter of interest is the probability of each of the genotypes A,B,O = (p,q,r), we can usethe augmented model: x = (nAA,nBA,nBB,nOO)
Here is a link to some very serious papers about tomography: http://www.stats.bris.ac.uk/pub/reports/TOMO.
And here is a link to a nice application of the EM to Bayesian self-organizing map simulations: http://www.aist.go.jp/NIBH/ b0616/Lab/BSOM1/
There is not necessarily a random component in the original problem that one wants to solve,usually a
problem for which there is no analytical solution. An unknown parameter (deterministic) is
expressed as a parameter of some random distribution, that is then simulated. The oldest
well-known example is that of the stimation of
by Bu on, in his needle on the oorboards
expeiment, where supposing a needle of the same length as the width between cracks we
have:


,


Which estimate is better?
This is similar to comparing statistical estimators in general. There are certain desirable properties
that we want estimators to have, consistency which ensures that as the sample size increases the
estimates converges to the right answer is ensured here by the properties of Riemann sums. Other
properties of interest are:
Both the above methods are unbiased, that is when repeated many times their average values are
centred in the actual value
.

More precisely, the variance of crude Monte Carlo is

) is:


Suppose we have two random variables that provide estimators for
, X and Y , that they have the
same variance but that they are negatively correlated, then
(X + Y ) will provide a better estimate for
because it's variance will be smaller.
This the idea in antithetic resampling (see Hall, 1989). Suppose we are interested with a real-valued
parameter, and that we have ordered our original samplex1 < x2
xn, for each resample
X* = {x
j1,xj2,...,xjn} and statistic s(X*) we associate X** by taking a special permutation of
the ji's that will make cov(X**),s(X*)) < 0, and as small as possible. If the statistic is a
smooth function of mean for instance, then the 'reversal permutation' that maps 1 into n,
2 into n - 1, etc... is the best, the small sample values are transformed into the larger
observations, and the average of these two estimates will give an estimate with smaller
variance.
This is often used in simulation, and is a method to work around the small area problem. If we want to evaluate tail probabilities, or very small areas, we may have very few hits of our random number generator in that area. However we can modify the random number generator, make that area more likely as long as we take that into account we we do the summation. Importance sampling is based on the equalities:

Try the following web site pages: http://www.maa.org/mathland/mathland_4_22.html http://stat.fsu.edu/~geo/diehard.html
And read chapter 7 of Numerical Recipes.
Motivation:
Random numbers are used:
1- To solve problems that are too complex (stochastic algorithms for NP complete problems)
2- For optimization (simulated annealing, genetic algorithms).
3- System Simulation (physics and biology, econometrics...)
4- Non-parametric tests, bootstrap, permutation tests.
5- Quadrature methods using Monte Carlo.
Solving a deterministic problem by rewriting it as equivalent to a problem with a stochastic component.
Little Example as a reminder of what Monte-Carlo is:


Most methods, even if they need non-uniform random variates, as in simulation, still use uniform random variables as their starting points.
So we will spend some time explaining the basics of random numbers.
The machine being deterministic, there is de nitely a problem, we will never get purely random numbers, they will only look so.
(We already saw that there are some pretty bad ones, when we did the multidimensional visualization of the IBM randu generated numbers with xgobi).
Also beware of srand, the basic unix one is bad.

The way most of the algorithms work is they use an initial seed and a simple recursion to feed out the next in sequence, which will also serve as a seed .. etc.
As soon as one hits a number that has already been used we will be having a sequence that was already seen. -Not so good-
Period of the sequence= minimal length of a subsequence that by repetition will form the whole sequence.
Period of the generator will be the largest period of any seed for a given generator.
We need to choose the parameters so as to have a maximal period.
Proposition:
The congruential generator has a period of m i :
mod 1(mod p) for any prime factor p of m
mod 1(mod 4) if 4 is a factor of m.Actual choices:
They have to take into account both theoretical and practical aspects:
Substract with Borrow
Instead of a single seed, the random number generator has a state: (35 words of memory), 32 fp numbers z between 0 and 1. The others make up an integer index : i, which varies between 1 and 32, and integer j and a 'borrow' ag.
j is used to initialize.

zi = fp(zi) xor j b is the borrow, small or zero. Complexity Justified Methods
Random number generation is akin to cryptography, knowing that factorising a number into its prime
factors is a very HARD problem (computationally) allows one to think up various `multiplicative
schemes', for instance:
and back to tables but on CDs now (new Numerical Recipes).
Luc Devroye, Non-Uniform Random Number Generation, Springer Verlag.
A very useful home page is Luc's with links to places where you can download random number
generator packgages etc... http://jeff.cs.mcgill.ca/~luc/ Web pages with good links to papers on
random number generation, and encryption packages:
http://www.cs.berkeley.edu/~daw/netscape-randomness.html
Based on the inverse CDF property :
X = F-1(U) ~ F where U is uniform:

This means that when we have a closed form formula for the inverse of the cdf which is not too expensive, we can use a random uniform and do the transformation that makes it have the correct density.
Example 1: Exponential

log(U)
Other distributions that have nice closed form expressions are the Cauchy, Rayleigh, Triangular,
Pareto..
Example 2: Correlated Random Variables
F-1(U) and G-1(U) have maximum positive correlation and have the distributions F and G, for
negative correlation just take G-1(-U).
Example 3: Maxima
(Fn)-1(U)
If closed forms are not available, but a good numerical solver is, then you can use that.
Theorem: Suppose f is a non zero density on [a,b] and zero outside. (a,b may be in nite). Suppose we know a function M(x) which majorizes f: M(x) > f(x) on [a,b] and take:

Algorithm:
This is why it works:

We need more than one uniform to generate one random variable with the right distribution.
How many are needed? This will depend on c =
abM(t)dt.
The number of necessary trials before accepting will be a geometric with p =
.
Example 1:
Suppose f is a bounded density on a compact support, f(x) < M take c = M(b - a),
Repeat the folowing:
u + (b - a)V Until UM < f(X)
Return X.
This is usually very ine cient, it is much better to take a bounding function that ` ts'.
Example 2:
Simulating a normal using an exponential as a bounding function: we use the fact that the absolute
value of a normal has density:

e-x majorizes f, take c =
, we can generate from the density m(x) easily because
we know how to generate from an exponential of rate 1.

If the probabilities are rational numbers and the variable discrete, or well approximated by a discrete distribution you can use table lookup, which is very expensive to set-up but can be useful if you a very a very large number of random variables to generate:
Algorithm:
, generate a table of length M with nk entries equal to k.
Examples : histogram data.
Box - Muller
Fact:
If X and Y are independent normals then their polar coordiantes R =
,
= arctan(Y/X)
are independent, R2 ~ expon(2),
~ U(0, 2
).
Modified to short-cut cosines and sines:
Generate a point uniformly in the disk(1): (V 1,V 2).
Call S = V 12 + V
22,
Return X =
V 1,Y =
V 2
See notes at:
http://www.ma.utexas.edu/documentation/nr/bookcpdf/c7-8.pdf (Chapter 7 of Numerical
Recipes)
or a nice tutorial here: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf
We want to compute bias, variability, con dence intervals for estimates for which the theory doesn't provide closed form solutions.
Examples :
Suppose we are interested in the estimation of an unknown parameter
(this doesn't mean that we
are in a parametric context). The two main questions asked at that stage are :
should be used ?
The second question has to be answered through information on the distribution or at least the variance of the estimator.
Of course there are answers in very simple contexts: for instance when the parameter of interest is the
mean
then the estimator
has a known standard deviation : the estimated standard error noted
sometimes
![[ 2 2]
^s = (Xi - X) /n](t05146x.png)
In maximum likelihood theory the question 1 is answered through using the mle and then the question 2 can be answered with an approximate standard error of

The bootstrap is a more general way to answer question 2 , with the following aspects:
If we had several samples from the unknown (true) distribution F then we could consider the variations of the estimator :

n :this side is what is called
bootstrapping.
From an original sample

). The bootstrap sample is often
denoted

If we are interested in the behaviour of a random variable
=
(X,F), then we can consider the
sequence of B new values obtained through computation of B new bootstrap samples.
Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability through a function denoted by RANINT(1,n) hereof.
Here is an example of a line of matlab that does just that:
T= floor((rand(1)*n)+1);
|
If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).
sample(n,n,replace=T)
|
An approximation of the distribution of the estimate
=
(Xn,F) is provided by the distribution
of

denotes the bootstrap distribution of
*, often approximated by

An example data set is LS (or laws) : the law school data consisting in 15 bivariate data points, each corresponding to an American law school as analysed in Efron 1982:

In order to repeat a large number of times the same instructions needed to construct B data sets one usually has to use a looping facility, actually matlab and S are easier to program in this circumstance because one can use ordinary loops and use owner-created functions, as well as trying to bene t form the vectorized commands.
laws<-scan()
576 3.39 635 3.30 558 2.81 578 3.03 666 3.44 580 3.07 555 3.00 661 3.43 651 3.36 605 3.13 653 3.12 575 2.74 545 2.76 572 2.88 594 2.96 "boot.simple"<- function(data, fun, nboot, ...) { n <- nrow(as.matrix(data)) res.boot <- vector("list", length = nboot) indb <- matrix(sample(n, size = n * nboot, T), nrow = nboot) for(b in (1:nboot)) { res <- fun(indb[b, ], ...) plot(res, xlab = "", ylab = "", lab = c(0, 0, 0)) res.boot[[b]] <- res } return(res.boot) } "boot"<- function(ind, nboot, fun, ...) { # Tibshirani's efficient Bootstrap # If fun is function of a matrix or returns multiple values # it has to be transformed before data <- matrix(sample(ind, size = length(ind) * nboot, replace = T), nrow = nboot) return(apply(data, 1, fun, ...)) } #Elementary example of using boot corr<- function(x, xdata) { cor(xdata[x, 1], xdata[x, 2]) } j1000 <- boot(1:15, 1000, theta, laws) hist(j1000 - 0.777) q.bt <- quantile(junk1000, probs = c(0.01, 0.99)) #With greek label hist(junk1000 - 0.777, nclass = 40, xlab = "q* - q", font = 13) #Example of Bootstrapping a Mann-Whitney statistic "mw"<- function(x, y) { #Function that computes a Mann-Whitney statistic for two samples n <- length(x) m <- length(y) rks <- rank(c(x, y)) wx <- sum(rks[1:n]) ux <- n * m + (m * (m + 1))/2 - wx mwx <- 1 - ux/(m * n) return(mwx) } "mwbt"<- function(x, y, nboot = 100, fun = mw, ...) { bts <- rep(0, nboot) for(b in (1:nboot)) { xbts <- sample(x, length(x), replace = T) ybts <- sample(y, length(y), replace = T) bts[b] <- fun(xbts, ybts, ...) } return(bts) } "mwxy"<- function(xy, n) { # n <- length(x) m <- length(xy) - n rks <- rank(xy) wx <- sum(rks[1:n]) ux <- n * m + (m * (m + 1))/2 - wx mwx <- 1 - ux/(m * n) return(mwx) } |
If one has a parametric distribution F
which is well de ned except for the pamameter
, instead of
drawing with replacement from the original sample, one can draw from the distribution F
, where
is
the estimation obtained from the original sample.
#Parametric Bootstrap
"boot.par"<- function(model.sim = law.sim, n = 15, nboot = 66) { #Parametric bootstrap of correlation coefficient #Law sim contains bivariate normal data theta <- rep(0, nboot) for(b in (1:nboot)) theta[b] <- corr(((b - 1) * n + 1):(b * n), law.sim) return(theta) } #Example of using the functions and plotting the results n1 <- rnorm(1000) n2 <- rnorm(1000) law.sim <- cbind(n1, 0.777 * n1 + sqrt(1 - (0.777^2)) * n2) rbootpar <- boot.par() hist(rbootpar) par(new=T) plot(density(rbootpar, width = 0.15), xlab = "", ylab = "", lab = c(0, 0, 0),type = "l") title("Parametric Bootstrap - Law School corr.", cex = 1.9) |
An intermediate solution between parametric and nonparametric bootsrapping is the smoothed bootstrap.
Instead of resampling directly from the empirical distribution Fn, one smoothes it out rst then the smoothed version is used to generate the new samples.
If a kernel smoother is used, for instance a Gaussian kernel, then generation from the smoothed
distribution
n is easy since it is su cient to resample with replacement and then perturb the
sampled point by adding a small gaussian random variable (Silverman B.W. & Young
G.A.)
#Smoothed Bootstrap
"boots"=function(x,nboot=100,fun,kernel="norm",h=1) { #Function doing a smoothed bootstrap with either # a normal or an epaneichikov kernel #h is the window width (can be tricky to choose) n=nrow(as.matrix(x)) theta=rep(0,nboot) for (k in (1:nboot)) { if(kernel=="norm") eps=rnorm(n,mean=0,sd=1) else eps=epan(n) xb=x[sample(n,n,T)]+eps*h theta[k]=fun(xb) } return(theta) } "epan"=function(n) { eps=rep(0,n) v=matrix(runif(3*n,-1,1),nrow=n) for(i in(1:n)) { if ((abs(v[i,3])>=abs(v[i,2])) && (abs(v[i,3])>=abs(v[i,1]))) eps[i]=v[i,2] else eps[i]=v[i,3] } return(eps) } |

(Fn) -
(F), (where Fn denotes the empirical distribution
function) and apply the bootstrap algorithm to its expectation then we obtain the estimate
:

*• =
b=1B
*b/B
![1
{ sum B } 2
SD = --1--- [h*b- h*•]2
B - 1 b=1](t05168x.png)
Confidence Intervals : The percentile method As before we denote by Gn*(t) = P
Fn
the
bootstrap distribution of
*, approximated by

The percentile method consists in taking the 1 - 2
con dence interval for
as being
![[ *-1 *- 1 ]
G n (a),G n (1 - a)](t05172x.png)
< x)
by the estimate G(x,Fn).
:
of the confidence interval
b* =
(X
b)
b* =
(X
b) into a new data set THETA
and (1 -
)’th quantiles for THETA.
As can be predicted, the bigger B, the number of bootstrap resamples, the better the approximations are, thus much recent work has been devoted to making the most of the computations :
Preliminaries : Pivotal quantities
Construction of con dence intervals is based (ideally) on a pivotal quantity Rn, that is a function of
the sample and the parameter
whose distribution is independant of the parameter
,
the sample, or any other unknown parameter. Thus for instance no knowledge is needed
about the parameters
and
2 of a normal variable X ~ N(
,
2) to construct con dence
intervals for both of these parameters, because one has two pivotal quantities available
:


Let cn(
) denote the largest (1 -
)th quantile of the distribution of the pivot Rn, and
the
parameter space, set of all possible values for
then

. Thus if t(1-
/2), 
/22 and
1-
/22 denote the relevant
quantiles for the distributions cited above the corresponding 1 -
con dence intervals will
be
![[ V~ - V~ - ]
x - t(1-a/2)s/ (n),x + t(1-a/2)s/ (n)](t05179x.png)
![[ 2 2 2 2 ]
(n - 1)s/x 1-a/2,(n - 1)s /xa/2](t05180x.png)
Such an ideal situation is of course rare and one has to do with approximate pivots, (sometimes called
roots). Therefore errors will be made on the level (1 -
) of the con dence sets. Level error is sensitive
to how far Rn is from being a pivot.
We will present several approaches for nding roots One approach is to studentize the root that is divide it by an estimate of its standard deviation. Another is to use a cdf transform constructed from a rst set of bootstrap simulations as explained below.
Suppose that we are interested in a parameter
, and that
is an estimate based on the n-sample
Xn.
We will suppose that the asymptotic variance of
is
2/n with a corresponding estimate
2/n.
The studentized bootstrap or bootstrap-t method (Efron 1982, Hall 1988) is based on
approximating the distribution function HF (x) of n
(
-
)/
by HFn(x), distribution of
n
(
-
)/
*.
Denote by cn,
* the
th quantile of the distribution of n
(
-
),

is :
![[ ]
h- n -12cn,(1-a)s,h - n -12cn,as](t05195x.png)
* is to use a second level bootstrap, this is rather expensive
compuation-wise : one could prefer as an intermediate solution a jackknife estimate.
: THETA(0)
and (1 -
)’th quantiles for the sample formed by all the
R’s.
bootsd =function(x, theta, nbootsd = 25)
{ n <- length(x) boo <- boot(x, fun = theta, nboot = nbootsd) return(sqrt(((n - 1)/n) * var(boo))) } bootst= function(x, fun = median, nboot = 100, nbootsd = 25) { # Studentized Bootstrap boo <- rep(0, nboot) for(b in (1:nboot)) { boo[b] <- bootsd(x, theta = fun, nbootsd = nbootsd) } thetab <- boot(x, fun = fun, nboo = nboot) theta0 <- fun(x) thetasd <- ((thetab - theta0)/boo) return(thetasd) } |
One of the drawbacks of the above-de ned method is that the variance estimate can tend to be quite
erratic. In the particular case of bootstrapping a correlation coe cient
one sometimes ends up with
con dence intervals that contain [-1,1], which is rather disappointing.
However experiences have shown that preliminary transformation of
such as Fishers z
transform (=arctanh
), followed by the Bootstrap-t and then retransformation of the
con dence interval back to the original scale leads to much more dependable con dence
intervals. Unfortunately such a handy transform is not always available but R. Tibshirani,
1988 has suggested an algorithm for creating automatically de ned variance stabilizing
transformations.
Based on the inverse CDF property :
i.e. that if X is distributed according to F then F-1(X) is uniform.
Denote by Hn = Hn(.,F) the c.d.f. of our root Rn(
) which is supposed to converge weakly to the
c.d.f. H = H(.,F) as n increases.
As we have seen the idea behind the bootstrap is to replace the unknown distribution function F by
its empirical counterpart Fn, thus using
n =
n(.,Fn) as an estimate for the unknown
cumulative distribution. Once validation of the procedure has been insured through verifying
that


), thus
providing a better pivot.
:
)’th quantile of the Z’s
Davison, Hinkley and Schechtmann, 1986 introduced the idea that one could reduce the amount of simulations (=B) necessary to attain a given precision by using each of the sample observations exactly equally often.
The algorithm proposed by the authors was however rather expensive timewise (
nB log nB). We
have a more economical one implemented proposed by Gleason .
:
The idea comes from current monte-carlo techniques, and is detailed in unpublished work by Hall. Like other enhancements presented here, its objective is to reduce the estimate's variance.
Let
1 and
2 be two di erent estimates of an unknown parameter
, if
1 and
2 can be
chosen so as to be negatively correlated, then
=
(
1 +
2) has half the variance of
1 and
2.
The bootstrap version of this is based on an antithetic permutation of the sample. B ordinary resampling operations then supplies the equivalent of 2B resampling operations.
(from Daniel W.W., p.216) :
Researchers wished to know if they could conclude that two populations of infants di er with respect
to the mean age at which they walked alone. The following data (ages in months) were
collected:
Sample from population A:
9.5, 10.5, 9.0, 9.75, 10.0, 13.0, 10.0, 13.5, 10.0, 9.5, 10.0, 9.75
Sample from population B:
12.5, 9.5, 13.5, 13.75, 12.5, 9.5, 12.0, 13.5, 12.0, 12.0
What should we conclude ? (
= 0.05)
The classical approach : a t-test
The software output from script in the appendix shows :
The mean for population A is : 10.375
The mean for population B is : 12.208
The di erence is : 1.833
As the variances can be considered as equal, a classical test under the normal assumption gives a
t-value of -3.1561 indicating a signi cant di erence because the P-value is 0.0046.
Assumptions :
A permutation test approach Based on the fact that if there is no di erence between the two
populations then the result will be compatible to allocation at random of each observation to one of
two groups (shu ing). Any statistical measure of the di erence could be used here : mean di erence,
studentised (t) di erence, or even a median di erence.
A bootstrap approach Following the previous de nitions of the bootstrap a possible bootstrap test could run as follows:
Approximate randomisation tests are frequently used to test independance of two sets of variables (which is the null hypothesis). Estimation of the distribution under this hypothesis is obtained by shu ing one variable (or set of variables) with respect to the other variable(s). (see Manly, 1990, for an easy overview).
However it is important to note that there is no assumption on the fact that the observations form a random sample from some population. There is no inference that can be carried automatically over to a hypothetic population.
On the contrary the assumption underlying the bootstrap is that the observations form a random sample whose distribution approximates that of the population. Thus the conclusions drawn from a bootstrap test can be used to infer on the population.
Noreen 1989 suggests using a combination of these two tests when inference is needed for testing whether two variables are stochastically independant in the population. However one must keep in mind that at the same time one tests whether the marginal distributions of the variables in the sample satisfactorially estimate those of the variables in the population.
Confidence Intervals for the expectation of an exponential distribution
For B large enough, P provides the probability that
is in the bootstrap con dence interval.
In order for the experience to be really useful, one ought to retry with di erent sample sizes and di erent expectations, one could also retry with all sorts of di erent distributions.
We call the unknown parameter
1 and our estimate
. Suppose that we had an ideal (unrealistic)
situation in which we knew the distribution of
-
1 , we will be interested especially in its quantiles
: denote the
quantile by
and the 1 -
quantile by
. By de nition we have: P(
-
1 <
) =
P(
-
1 <
) = 1 -
It is by writing this that you can build the correct intervals.
Now I am going to compare an answer to an estimation probalem both through maximum likelihood and through bootstrapping. First let us remind ourselves of notations:
X1,X2,X3,...Xn have joint density denoted

Given observed values X1 = x1,X2 = x2,...,Xn = xn, the liklihood of
is the function

.
If the distribution is discrete, f will be the frequency distribution function.
In words: lik(
)=probability of observing the given data as a function of
. De nition:
The maximum likelihood estimate (mle) of
is that value of
that maximises lik(
): it is the value that
makes the observed data the \most probable".
If the Xi are iid, then the likelihood simpli es to

Rather than maximising this product which can be quite tedious, we often use the fact that the logarithm is an increasing function so it will be equivalent to maximise the log likelihood:

X1,X2,...,Xm are counts in cells/ boxes 1 up to m, each box has a di erent probability
(think of the boxes being bigger or smaller) and we x the number of balls that fall to
be n:X1 + X2 +
+ Xm = n. The probability of each box is pi, with also a constraint:
p1 + p2 +
+ pm = 1, this is a case in which the Xi's are NOT independent, the joint probability of a
vector x1,x2,...,xm is called the multinomial ,and has the form:
Each box taken separately against all the other boxes is a binomial, this is an extension thereof.
We study the log-likelihood of this :

We use

This is a trinomial with three boxes: the probabilities are parametrized by:



Each of the counts is binomially distributed with probabilities as described above so that:

We can either use asymptotic maximum likelihood or the bootstrap, we are going to compare the two using Splus.
After some algebra we nd:
!['' ^ ----2n----
E[l (h )] = - ^ ^
(1 - h )h](t05243x.png)
Of course as usual we don't know
we have to plug in the estimate:

=
= .4247, and the MLE estimate
for the standard deviation is: s
= .011, we will now compare to what a parametric bootstrap
estimate gives:
The bootstrap procedure we are going to follow generates multinomials with the estimated probabilites and then computes for these new counts the bootstrap estimate for which we then compute the standard deviation.
Here is the R code I used:
multin = function(pvec, num)
{ #Function that generates the output of #putting num balls into length(pvec) boxes, #with probabilities each given by the vector pvec run <- runif(num) gener <- cut(run, cumsum(c(0, pvec))) return(gener) } > theta=function(xvec) { return((2 * xvec[3] + xvec[2])/(2 * sum(xvec))) } bts.ml.mw=function(thetahat = 0.4247, nboot = 1000) { res.bt <- rep(0, nboot) pv <- c((1 - thetahat)^2, 2 * thetahat * (1 - thetahat), thetahat^2) for(b in (1:nboot)) { xvec <- multin(pv, 1029) res.bt[b] <- theta(xvec) } return(res.bt) } bt1 <- bts.ml.mw() quantile(bt1, c(0.025, 0.975)) 2.5% 97.5% 0.403292 0.4446186 quantile(bt1 - 0.4247, c(0.025, 0.975)) 2.5% 97.5% -0.02140797 0.01991856 > sqrt( var(bt1)) [1] 0.01074981 |
CRAN
Numerical Recipes http://www.library.cornell.edu/nr/cbookcpdf.html
Anuj Srivastava's Notes: http://calais.stat.fsu.edu/anuj/PDF- les/STA5107-s-05/BOOK.pdf
Julian Faraways Anova and Regression with R: http://www.stat.lsa.umich.edu/~faraway/book/
http://www.quantlet.com/mdstat/scripts/csa/html/node45.html
http://www.neurosci.aist.go.jp/ akaho/MixtureEM.html
http://www.riskglossary.com/articles/monte_carlo_method.htm
MCMC notes:
The Gibbs Sampler Motif Page
Gibbs sampler for a Normal
Peter Green's Reversible Jump
Bootstrap Course:
Susan Holmes Stat 208 course
Density Estimation:
http://www.stat.berkeley.edu/~stark/Teach/S240-s03/Notes/ch8.pdf
Smoothing:
John Fox introduction to nonparametric regression
MAchine Learning Data Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html