Examples :
Suppose we are interested in the estimation of an unknown
parameter
(this doesn't mean that we are in a parametric context).
The two main questions asked at that stage are :
The second question has to be answered through information on the distribution or at least the variance of the estimator.
Of course there are answers in very simple contexts:
for instance when the parameter of interest is
the mean
then the estimator
has a known standard
deviation : the estimated standard error
noted sometimes
In maximum likelihood theory the question
1 is answered through using the mle and then the
question 2 can be answered with an approximate
standard error of
The bootstrap is a more general way to answer question 2 , with the following aspects:
If we had several samples from the unknown
(true) distribution
then we could consider the variations
of the estimator :
,
and that is an empirical
If we are interested in the
behaviour of a random variable
,
then we can consider the sequence
of
new values
obtained through
computation of
new bootstrap samples.
Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability through a function denoted by RANINT(1,n) hereof.
Here is an example of a line of matlab that does just that:
T= floor((rand(1)*n)+1);
If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).
sample(n,n,replace=T)
An approximation of the distribution of the
estimate
is provided by the distribution
of
An example data set is LS (or laws) : the law
school data consisting in 15 bivariate data points,
each corresponding to an American law school
as analysed in EFRON 1982:
In order to repeat a large number of
times the same instructions needed to
construct
data sets one usually has to use
a looping facility, actually matlab and
S are easier to program in
this circumstance because one
can use ordinary loops and
use owner-created functions, as well as trying to
benefit form the vectorized commands.
laws<-scan()
576 3.39
635 3.30
558 2.81
578 3.03
666 3.44
580 3.07
555 3.00
661 3.43
651 3.36
605 3.13
653 3.12
575 2.74
545 2.76
572 2.88
594 2.96
"boot.simple"<-
function(data, fun, nboot, ...)
{
n <- nrow(as.matrix(data))
res.boot <- vector("list", length = nboot)
indb <- matrix(sample(n, size = n * nboot, T), nrow = nboot)
for(b in (1:nboot)) {
res <- fun(indb[b, ], ...)
plot(res, xlab = "", ylab = "", lab = c(0, 0, 0))
res.boot[[b]] <- res
}
return(res.boot)
}
"boot"<-
function(ind, nboot, fun, ...)
{
# Tibshirani's efficient Bootstrap
# If fun is function of a matrix or returns multiple values
# it has to be transformed before
data <- matrix(sample(ind, size = length(ind) * nboot, replace = T),
nrow = nboot)
return(apply(data, 1, fun, ...))
}
#Elementary example of using boot
corr<- function(x, xdata)
{
cor(xdata[x, 1], xdata[x, 2])
}
j1000 <- boot(1:15, 1000, theta, laws)
hist(j1000 - 0.777)
q.bt <- quantile(junk1000, probs = c(0.01, 0.99))
#With greek label
hist(junk1000 - 0.777, nclass = 40, xlab = "q* - q", font = 13)
#Example of Bootstrapping a Mann-Whitney statistic
"mw"<-
function(x, y)
{
#Function that computes a Mann-Whitney statistic for two samples
n <- length(x)
m <- length(y)
rks <- rank(c(x, y))
wx <- sum(rks[1:n])
ux <- n * m + (m * (m + 1))/2 - wx
mwx <- 1 - ux/(m * n)
return(mwx)
}
"mwbt"<-
function(x, y, nboot = 100, fun = mw, ...)
{
bts <- rep(0, nboot)
for(b in (1:nboot)) {
xbts <- sample(x, length(x), replace = T)
ybts <- sample(y, length(y), replace = T)
bts[b] <- fun(xbts, ybts, ...)
}
return(bts)
}
"mwxy"<-
function(xy, n)
{
# n <- length(x)
m <- length(xy) - n
rks <- rank(xy)
wx <- sum(rks[1:n])
ux <- n * m + (m * (m + 1))/2 - wx
mwx <- 1 - ux/(m * n)
return(mwx)
}
#Parametric Bootstrap
"boot.par"<-
function(model.sim = law.sim, n = 15, nboot = 66)
{
#Parametric bootstrap of correlation coefficient
#Law sim contains bivariate normal data
theta <- rep(0, nboot)
for(b in (1:nboot))
theta[b] <- corr(((b - 1) * n + 1):(b * n), law.sim)
return(theta)
}
#Example of using the functions and plotting the results
n1 <- rnorm(1000)
n2 <- rnorm(1000)
law.sim <- cbind(n1, 0.777 * n1 + sqrt(1 - (0.777^2)) * n2)
rbootpar <- boot.par()
hist(rbootpar)
par(new=T)
plot(density(rbootpar, width = 0.15), xlab = "", ylab = "",
lab = c(0, 0, 0),type = "l")
title("Parametric Bootstrap - Law School corr.", cex = 1.9)
Instead of resampling directly from
the empirical distribution
,
one smoothes it out first then the
smoothed version
is used
to generate the new samples.
If a kernel smoother is used, for instance
a Gaussian kernel, then generation from
the smoothed distribution
is easy since it is sufficient to resample
with replacement and then perturb the sampled
point by adding a small gaussian random variable
(SILVERMAN B.W. & YOUNG G.A.)
#Smoothed Bootstrap
"boots"_function(x,nboot=100,fun,kernel="norm",h=1)
{
#Function doing a smoothed bootstrap with either
# a normal or an epaneichikov kernel
#h is the window width (can be tricky to choose)
n_nrow(as.matrix(x))
theta_rep(0,nboot)
for (k in (1:nboot))
{
if(kernel=="norm") eps_rnorm(n,mean=0,sd=1)
else eps_epan(n)
xb_x[sample(n,n,T)]+eps*h
theta[k]_fun(xb)
}
return(theta)
}
"epan"_function(n)
{
eps_rep(0,n)
v_matrix(runif(3*n,-1,1),nrow=n)
for(i in(1:n))
{
if ((abs(v[i,3])>=abs(v[i,2])) &&
(abs(v[i,3])>=abs(v[i,1])))
eps[i]_v[i,2]
else eps[i]_v[i,3]
}
return(eps)
}
, (where
The percentile method consists
in taking the
confidence interval
for
as being
Preliminaries : Pivotal quantities
Construction of confidence intervals
is based (ideally) on a pivotal quantity
, that is a function of the sample
and the parameter
whose distribution
is independant of the parameter
, the sample, or any other unknown
parameter.
Thus for instance no knowledge
is needed about the parameters
and
of a normal variable
to construct
confidence intervals for
both of these parameters,
because one has two pivotal quantities available :
Let
denote the largest
quantile of the distribution of the
pivot
, and
the parameter space, set of all possible values for
then
Such an ideal situation is of course rare
and one has to do with approximate pivots,
(sometimes called roots).
Therefore errors will be made on
the level (
)
of the confidence sets.
Level error is sensitive to how far
is from being a pivot.
We will present several approaches for finding roots One approach is to studentize the root that is divide it by an estimate of its standard deviation. Another is to use a cdf transform constructed from a first set of bootstrap simulations as explained in section 2.3.
Suppose that we are interested in a parameter
, and that
is an estimate
based on the n-sample
.
We will suppose that the asymptotic variance of
is
with a corresponding estimate
.
The studentized bootstrap or bootstrap-t method
(EFRON 1982, HALL 1988)
is based on approximating the distribution function
of
by
, distribution
of
.
Denote by
the
th quantile
of the distribution of
,
"bootsd" <-
function(x, theta, nbootsd = 25)
{
n <- length(x)
boo <- boot(x, fun = theta, nboot = nbootsd)
return(sqrt(((n - 1)/n) * var(boo)))
}
"bootst" <- function(x, fun = median, nboot = 100, nbootsd = 25)
{
# Studentized Bootstrap
boo <- rep(0, nboot)
for(b in (1:nboot)) {
boo[b] <- bootsd(x, theta = fun, nbootsd = nbootsd)
}
thetab <- boot(x, fun = fun, nboo = nboot)
theta0 <- fun(x)
thetasd <- ((thetab - theta0)/boo)
return(thetasd)
}
However experiences have shown that
preliminary transformation of
such
as Fishers
transform (=arctanh
),
followed by the Bootstrap-t and then retransformation
of the confidence interval back to the original
scale leads to much more dependable
confidence intervals. Unfortunately such a handy
transform is not always available but
R. TIBSHIRANI, 1988
has suggested an algorithm for
creating automatically defined variance stabilizing
transformations.
Based on the inverse CDF property :
i.e. that if
is distributed according
to
then
is uniform.
Denote by
the c.d.f. of our root
which is supposed to
converge weakly
to the c.d.f.
as
increases.
As we have seen the idea behind the
bootstrap is to replace the unknown
distribution function
by its empirical
counterpart
, thus using
as
an estimate for the unknown
cumulative distribution.
Once validation of the procedure has been insured through
verifying that
than is that of
The algorithm proposed by the authors was
however rather expensive timewise
(
).
We have a more economical one implemented
proposed by Gleason .
:
Let
and
be two
different estimates of an unknown
parameter
, if
and
can
be chosen so as to be negatively correlated,
then
has half the variance of
and
.
The bootstrap version of this is based on an antithetic permutation of the sample. B ordinary resampling operations then supplies the equivalent of 2B resampling operations.
The software output from script in the appendix shows :
The mean for population A is : 10.375
The mean for population B is : 12.208
The difference is : 1.833
As the variances can be considered as equal,
a classical test under the normal assumption gives
a t-value of -3.1561 indicating a significant difference
because the P-value is 0.0046.
Assumptions :
A bootstrap approach Following the previous definitions of the bootstrap a possible bootstrap test could run as follows:
However it is important to note that there is no assumption on the fact that the observations form a random sample from some population. There is no inference that can be carried automatically over to a hypothetic population.
On the contrary the assumption underlying the bootstrap is that the observations form a random sample whose distribution approximates that of the population. Thus the conclusions drawn from a bootstrap test can be used to infer on the population.
NOREEN 1989 suggests using a combination of these two tests when inference is needed for testing whether two variables are stochastically independant in the population. However one must keep in mind that at the same time one tests whether the marginal distributions of the variables in the sample satisfactorially estimate those of the variables in the population.
For B large enough,
provides the probability that
is in the bootstrap confidence interval.
In order for the experience to be really useful, one ought to retry with different sample sizes and different expectations, one could also retry with all sorts of different distributions.
It is by writing this that you can build the correct intervals.
Now I am going to compare an answer to an estimation probalem both through maximum likelihood and through bootstrapping. First let us remind ourselves of notations:
have joint density
denoted
Given observed values
, the liklihood
of is the function
If the distribution is discrete,
will be the frequency distribution
function.
In words:
Definition:
The maximum likelihood estimate (mle)
of is that value of that maximises
: it is the value that makes
the observed data the ``most probable''.
If the
are iid, then the likelihood simplifies to
Rather than maximising this product which can be
quite tedious, we often use the fact that
the logarithm is an increasing function so it
will be equivalent to maximise the log likelihood:
are counts in cells/ boxes 1 up to m,
each box has a different probability (think of the boxes being
bigger or smaller) and we fix the number of balls that fall to
be
:
.
The probability of each box is
, with also a constraint:
, this is a case in which
the
are NOT independent, the joint probability of a vector
is called the multinomial ,and has the form:
Each box taken separately against all the other boxes is a binomial, this is an extension thereof.
We study the log-likelihood of this :
We use
This is a trinomial with three boxes:
the probabilities are parametrized by:
Each of the counts is binomially distributed with probabilities as described above so that:
We can either use asymptotic maximum likelihood or the bootstrap, we are going to compare the two using Splus.
After some algebra we find:
Of course as usual we don't know
we have to plug in the estimate:
The bootstrap procedure we are going to follow generates multinomials with the estimated probabilites and then computes for these new counts the bootstrap estimate for which we then compute the standard deviation.
Here is the Splus code I used:
multin <-
function(pvec, num)
{
#Function that generates the output of
#putting num balls into length(pvec) boxes,
#with probabilities each given by the vector pvec
run <- runif(num)
gener <- cut(run, cumsum(c(0, pvec)))
return(gener)
}
> theta<-
function(xvec)
{
return((2 * xvec[3] + xvec[2])/(2 * sum(xvec)))
}
bts.ml.mw<-
function(thetahat = 0.4247, nboot = 1000)
{
res.bt <- rep(0, nboot)
pv <- c((1 - thetahat)^2, 2 * thetahat * (1 - thetahat), thetahat^2)
for(b in (1:nboot)) {
xvec <- multin(pv, 1029)
res.bt[b] <- theta(xvec)
}
return(res.bt)
}
bt1 <- bts.ml.mw()
quantile(bt1, c(0.025, 0.975))
2.5% 97.5%
0.403292 0.4446186
quantile(bt1 - 0.4247, c(0.025, 0.975))
2.5% 97.5%
-0.02140797 0.01991856
> sqrt( var(bt1))
[1] 0.01074981
EVAL Execute string with MATLAB expression.
EVAL(s), where s is a string, causes MATLAB to execute
the string as an expression or statement.
EVAL(s1,s2) provides the ability to catch errors. It
executes string s1 and returns if the operation was
successful. If the operation generates an error,
string s2 is evaluated before returning. Think of this
as EVAL('try','catch'). The error string produced by the
failed 'try' can be obtained with LASTERR.
[X,Y,Z,...] = EVAL(s) returns output arguments from the
expression in string s.
The input strings to EVAL are often created by
concatenating substrings and variables inside square
brackets. For example:
Generate a sequence of matrices named M1 through M12:
for n = 1:12
eval(['M' num2str(n) ' = magic(n)'])
end
Run a selected M-file script. The strings making up
the rows of matrix D must all have the same length.
D = ['odedemo '
'quaddemo'
'fitdemo '];
n = input('Select a demo number: ');
eval(D(n,:))
See also FEVAL, EVALIN, ASSIGNIN, LASTERR.
%BOOTSTRP Bootstrap statistics.
% BOOTSTRP(NBOOT,BOOTFUN,D1,...) draws NBOOT bootstrap data samples and
% analyzes them using the function, BOOTFUN. NBOOT must be a
% positive integer.
% BOOTSTRAP passes the (data) D1, D2, etc. to BOOTFUN.
%
% [BOOTSTAT,BOOTSAM] = BOOTSTRP(...) Each row of BOOTSTAT contains
% the results of BOOTFUN on one bootstrap sample. If BOOTFUN returns a matrix
% then this output is converted to a long vector for storage in BOOTSTAT.
% BOOTSAM is a matrix of indices into the row
% Initialize matrix to identify scalar arguments to bootfun.
scalard = zeros(nargin-2,1);
lp = '('; % Left parenthesis string variable.
rp = ')'; % Right parenthesis string variable.
c = ','; % Comma string variable.
ds = 'd'; % 'd' as a string variable.
% Construct argument list for bootfun
for k = 3:nargin
dk = [ds,num2str(k-2)];
[row,col] = size(eval(dk));
if max(row,col) == 1
scalard(k-2) = 1;
end
if row == 1 & scalard(k-2) == 0
eval([dk,'=',dk,'(:);']);
row = col;
end
if k == 3
pstring = [lp,dk];
n = row;
if nargin == 3
pstring = [pstring,rp];
end
elseif k == nargin & nargin > 3
pstring = [pstring,c,dk,rp];
else
pstring = [pstring,c,dk];
end
end
% Create index matrix of bootstrap samples.
bootsam = unidrnd(n,n,nboot);
% Get result of bootfun on actual data and find its size.
thetafit = eval([bootfun,pstring]);
[ntheta ptheta] = size(thetafit);
% Initialize a matrix to contain the results of all the
%bootstrap calculations.
bootstat = zeros(nboot,ntheta*ptheta);
dbs = 'db';
% Do bootfun - nboot times.
for bootiter = 1:nboot
for k = 3:nargin
dk = [ds,num2str(k-2)];
dbk = [dbs,num2str(k-2)];
if scalard(k-2) == 0
eval([dbk,'=',dk,'(bootsam(:,',num2str(bootiter),'),:);']);
else
eval([dbk,'=',dk,';']);
end
if k == 3
pstring = [lp,dbk];
n = row;
if nargin == 3
pstring = [pstring,rp];
end
elseif k == nargin & nargin > 3
pstring = [pstring,c,dbk,rp];
else
pstring = [pstring,c,dbk];
end
end
evalstr = [bootfun,pstring];
tmp = eval(evalstr);
bootstat(bootiter,:) = (tmp(:))';
end
Suppose we wanted to estimate the overall risk of a CART tree through the bootstrap:
We will call
the rule built from
the sample
,
is the response
of size
and
the explanatory variable, we call
the
resubstitution estimate
The true risk for
is:
The difference
Now we are going to show that under certian conditions the bootstrap is going to behave badly.
Suppose that this data is not well adapted to
a prediction of
by
and that the variables
are independent.
Say we are trying to classify the observations, that is
that the repsonse variable is a categorical variable,
with say,
clases, the prior probabilities
for each class are unknown so taken to be
Every classification rule will in fact have risk
.
Suppose we construct a tree that goes all the way down to
individual leaves, that is the rule will give exact
classification for an observation it has already seen
and will classify correctly an observation it
is seeing for the first time with probability
.
Thus
and
,
the expected value of the bias
computed for the empirical distribution
is
( any member
of the original sample does not appear at all in the sample), thus
Several papers by Freedman and Peters, (Jasa 1984, Journ Bus. Econ. Stat. 1985) document the fact that the bootstrap does not give the correct answer for multivariate regression situations where the number of variables is of a similar order as the number of observations.
These two examples seem to suggest that much care must be used in applying the bootstrap in nonparametric situations!
zea_scan()
1: 188 96 168 176 153 172 177 163 146 173 186 168 177 184 96
16: 139 163 160 160 147 149 149 122 132 144 130 144 102 124 144
zea_matrix(zea,ncol=2)
zea
[,1] [,2]
[1,] 188 139
[2,] 96 163
[3,] 168 160
[4,] 176 160
[5,] 153 147
[6,] 172 149
[7,] 177 149
[8,] 163 122
[9,] 146 132
[10,] 173 144
[11,] 186 130
[12,] 168 144
[13,] 177 102
[14,] 184 124
[15,] 96 144
diff_zea[,2]-zea[,1]
> diff
[1] -49 67 -8 -16 -6 -23 -28 -41 -14 -29 -56 -24 -75 -60 48
> rank(abs(diff))
[1] 11 14 2 4 1 5 7 9 3 8 12 6 15 13 10
sign(diff)
[1] -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1
> sign(diff)*rank(abs(diff))
[1] -11 14 -2 -4 -1 -5 -7 -9 -3 -8 -12 -6 -15 -13 10
W_sum(wvec[wvec>0])
> W
[1] 24
args(wilcox.test)
function(x, y = NULL, alternative = "two.sided", mu = 0,
paired =F, exact =T,correct=T)
sum(wvec[wvec<0])
[1] -96
> wilcox.test(zea[,1],zea[,2],paired=T)
Exact Wilcoxon signed-rank test
data: zea[, 1] and zea[, 2]
signed-rank statistic V = 96, n = 15, p-value = 0.0413
alternative hypothesis: true mu is not equal to 0
> wilcox.test(zea[,1],zea[,2],paired=T,exact=T,alternative="greater")
Exact Wilcoxon signed-rank test
data: zea[, 1] and zea[, 2]
signed-rank statistic V = 96, n = 15, p-value = 0.0206
alternative hypothesis: true mu is greater than 0
> 15*16*31/24
[1] 310
> sqrt(310)
[1] 17.60682
(96-120)/17.6
[1] -1.363636
2. The original Gray code, Fisher s test
A. The original Gray code.
Let
be the set of binary
-tuples. This may be identified with
the vertices of the usual
-cube or with the set of all subsets of an
element set. The original Gray code gives an ordered list of
with
the property that successive values differ only in a single place. For
example, when
such a list is
Gray codes were invented by F. Gray (1939) for sending sequences of bits using a frequency transmitting device. If the ordinary integer indexing of the bit sequence is used then a small change in reception, between 15 and 16, for instance, has a large impact on the bit string understood. Gray codes enable a coding that minimizes the effect of such an error. A careful description and literature review can be found in Wilf (1989). One crucial feature: there are non-recursive algorithms for providing the successor to a vector in the sequence in a simple way. This is implemented through keeping track of the divisibility by 2 and of the step number.
One way to express this is as follows : let
be the
binary representation of the integer
, and let
be the string of rank
in the Gray code list.
Then
and
.
For example, when
, the list above shows the string of
rank
is
; now
.
So
. Thus from a given string in the Gray code and the rank one can
compute the successor.
There is a parsimonious implementation of this in the algorithm given
in the appendix. Proofs of these results can be found in Wilf (1989).
We study two uses for such Gray codes here: they provide a way of complete
enumeration for Fisher's randomization procedure and for Hartigan's subset
procedure. Similar procedures can be used for signed rank statistics and
the usual two sample (
out of
) tests.
B. Fisher's randomization test.
Fisher (1935) gave a permutation justification for the usual test for
paired observations
. In his example (Darwin's
Zea data)
and
were real numbers representing plant height for
treated and untreated plants. Darwin had calculated
the mean difference. Fisher gave a way of calibrating this by calculating
and considering all
possible sums
with
.
We observe that a Gray code allows us to systematically run through all
patterns. One need only update the present value of the sum by
changing a single value. Furthermore the symmetry of the problem saves a
factor 2 of the computations. Figure 2.1 shows a histogram of the
randomization distribution based on this procedure applied to the Zea data.
The original value of
the sum
is 39.25 inches.
This clearly falls in the tail of the distribution. In fact the exact
proportion of sign patterns with a larger absolute sum is
(as Fisher also calculated).
Figure 2.1 Randomization Distribution of
for Zea Data
For the remainder of this section we will use the exact distribution to compare two methods of approximation for the significance level of Fisher's randomization test.
Given
let
. Let
and let
be the number of choices of
sign
values such that
is strictly larger than
.
For Fisher's example
giving a p-value
of
.
The first approximation to
uses the fact that
is
a sum of independant (non identically distributed) random variables.
Thus with
,
Figure 2.2a shows a plot of
versus x for the Zea data.
The Normal approximation
is quite good. It gives a p-value of 0.0538.
Fisher himself suggested a second approximation :
using the studentized statistic
This
is a one to one increasing function
of
,
if
.
So the number of sign patterns such that
is strictly larger than the observed
equals
. Thus Fisher's t-approximation is based on :
Remarks : The Gray code can be useful with any statistic and savings increase if efficient updating for single changes are available.
Fisher's test and this data set have been extensively
discussed in the statistics literature. Fisher himself avoided complete
computation by looking at the tails and coming in carefully. Usual
approximations proceed by Monte Carlo, however Pagano and Tritchler (1983)
give a fast Fourier algorithm for complete enumeration.
We give a different Fourier implementation and further discussion in
Section 2D. Lehmann (1975) and Tritchler (1984)
discuss exact confidence intervals derived from Fisher's randomization
test. Manly (1991) carries out the exact computation without using an
updating procedure. Basu (1980) and his discussants debate the logic of
such ``permutation'' tests. Maritz (1981) gives a clear presentation
of more general two-sample location problems.
The Gray code approach can be used for
. It also adapts to the
vector valued
statistic studied by Eaton and Efron (1970).
Recently the permutation test based on thsi statistic has been renamed `Mantel's test' by the ecological community (Manly 1992).
The statistic we look at is
Which gives
an indication of their correlation.
In fact we remark that if each distance matrix were vectorized, this is the equivalent of an uncentered covariance between the vectors.
If the two distances are small in the same places, and big in the same places the statistic will be large indicating a strong link between the distances.
In order to actually test the significance of such a link,
a permutation test is performed:
the null distribution
is obtained by permuting the elements
of
to obtain what is denoted by
, keeping the matrix symmetric, and
recomputing
the statistic:
The Splus progam enabling the implementation of this algorithm can be found below.
What will be useful to us here is the idea of a vectorial covariance between distance matrices, which we will apply to the special case of correlation matrices which are in fact similarities.
We define as in (Escoufier, 1973), the vector
covariance between two matrices as their natural
inner product:
daniel<-
function(mata, matb, nperm)
{
if(ncol(mata) != ncol(matb))
stop("matrices should have same dimension")
if(nrow(mata) != nrow(matb))
stop("matrices should have same dimension")
r0 <- stat(mata, matb)
r <- matrix(0, nperm, 1)
for(i in (1:nperm)) {
matp <- permute(mata)
r[i] <- stat(matp, matb)
}
p <- as.numeric(r < r0)/(nperm + 1)
return(p, r0, r)
}
> permute
function(mata)
{
ap <- matrix(0, nrow(mata), nrow(mata))
ni <- sample(nrow(mata))
for(i in 1:nrow(mata)) {
for(j in 1:(i - 1)) {
ap[i, j] <- mata[ni[i], ni[j]]
}
}
return(ap)
}
> stat<-
function(mata, matb)
{
r0 <- 0
for(i in 2:nrow(mata))
for(j in (1:(i - 1))) {
r0 <- r0 + mata[i, j] * matb[i, j]
}
return(r0)
}
More precisely, suppose we have two sample of observations,
we will call them the
's and the
's.
Null hypothesis
.
Test statistic is the number of runs
, that is the
number of consecutive sequences of identical labels.
The null distribution of the test statistic
can be found by a combinatorial argument:
The main reference is Friedman and Rafsky 1979, Ann stat.,pp. 697-717.
One may either use multidimensional data or first make the data bidimensionnal by principal components. (or another dimension reducing technique), in order to be able to represent the tree in a graphic.
Algorithm:
The output to create will be a vector of numbers
that correspond to where the
th point should be connected
to.
Here are some examples with Splus:
mstree(vincr[1:10, ])
> mstree(vincr[1:10, ])
$x:
[1] 4.20959759 -3.70445704 4.54166603 7.22903347 4.89603710 2.28360796
[7] 0.00000000 -2.00589728 1.03156424 0.04926162
$y:
[1] -7.0463328 -6.4673190 0.4050449 -0.3888339 -2.2247181 0.0000000
[7] 0.0000000 -3.2911484 -0.8079203 -1.5704904
$mst:
[1] 10 8 6 3 3 7 10 10 10
mst: vector of length nrow(x)-1 describing the edges in the
minimal spanning tree. The ith value in this vector is an
observation number, indicating that this observation and
the ith observation should be linked in the minimal span-
ning tree.
$order:
[,1] [,2]
[1,] 4 7
[2,] 3 10
[3,] 5 6
[4,] 6 9
[5,] 7 8
[6,] 10 3
[7,] 1 1
[8,] 9 4
[9,] 8 5
[10,] 2 2
order: matrix, of size nrow(x) by 2, giving two types of ord-
ering: The first column presents the standard ordering
from one extreme of the minimal spanning tree to the oth-
er. This starts on one end of a diameter and numbers the
points in a certain order so that points close together in
Euclidean space tend to be close in the sequence. The
second column presents the radial ordering, based on dis-
tance from the center of the minimal spanning tree. These
can be used to detect clustering. See below for graph
theory definitions.
plot(x, y) # plot original data
mst <- mstree(cbind(x, y), plane=F) # minimal spanning tree
# show tree on plot
segments(x[seq(mst)], y[seq(mst)], x[mst], y[mst])
i <- rbind(iris[,,1], iris[,,2], iris[,,3])
tree <- mstree(i) # multivariate planing
plot(tree, type="n") # plot data in plane
text(tree, label=c(rep(1, 50), rep(2, 50), rep(3, 50))) # identify points
# get the absolute value stress e
distp <- dist(i)
dist2 <- dist(cbind(tree$x, tree$y))
e <- sum(abs(distp - dist2))/sum(distp)
0.1464094
This extends to the case of more than two levels of treatment by doing the following:
Example: Wine data As you may remember, this data was 14 physico-chemical composition varaibles (all continuous), that we are trying to relate to several categorical variables.
acp.vin <- acp.us(vin[, 1:14], q.choix = 2)
#Two components was the choice here
plot(acp.vin$ind.cords, type = "n")
text(acp.vin$ind.cords[as.numeric(vin[, 15]) == 1, ], "1")
text(acp.vin$ind.cords[as.numeric(vin[, 15]) == 2, ], "2")
text(acp.vin$ind.cords[as.numeric(vin[, 15]) == 3, ], "3")
mst.wine <- mstree(acp.vin$ind.cords, plane = F)
[1] 10 66 6 28 8 4 6 27 7 9 1 14 14 11 14 44 15 25 45 41 39 50 16 76 78
[26] 37 58 5 27 31 62 2 63 31 55 69 40 45 38 18 21 52 51 47 37 26 42 49 76 46
[51] 75 22 54 55 34 53 34 57 33 61 35 33 65 60 78 59 64 77 40 73 72 74 71 43 48
[76] 46 75
segments(acp.vin$ind.cords[seq(mst.wine), 1], acp.vin$ind.cords[seq(mst.wine),
2], acp.vin$ind.cords[mst.wine, 1], acp.vin$ind.cords[mst.wine, 2])
title("Wines Minimal Spanning Tree")
> sum(vin[mst.wine,15]==vin[-78,15])
[1] 44
randclass_function(S=1000,data=as.numeric(vin[-78,15]),
compar=as.numeric(vin[mst.wine,15])){
same_rep(0,S)
n_length(compar)
for (i in (1:S))
same[i]_sum(as.numeric(data==compar[sample(n,n)]))
return(same)
}
r1_randclass()
> max(r1)
[1] 39
> max(r2)
[1] 40
hist(r1,nclass=50)
Matlab MSTREE algorithm:
this doesn't work but should be
made to as part of the mstree project.
function out=mst(distm)
%Computes the minimum spanning tree
%form a matrix of distances between n objects
%
n=length(distm);
out=[-n*ones(1,n-1) 0];
distree=distm(n,:);
n1=(1:(n-1));
[a b vectd]=find(tril(distm));
[a b vecti]=find(tril(ones(n,1)* (1:n),-1));
for k =(1:(n-1))
%Find the nearest edge to the tree
nottree=find(out<0);
[distmin imin]=min(vectd(out<0));
%Find the index of the minimum
imin=n1(distm==distmin);
imin=imin(1);
%Adjoin a new edge
out(imin)=-out(imin);
%Update list of nearest vertices
for i = (1:(n-1))
if (out(i)<0 & distm(i,imin)<distm(i,-out(i)))
out(i)=-imin;
end;
end
end
This means that when we have a closed form formula for the inverse of the cdf which is not too expensive, we can use a random uniform and do the transformation that makes it have the correct density.
Example 1: Exponential
Example 2: Correlated Random Variables
and
have maximum
positive correlation
and have the distributions
and
,
for negative
correlation just take
.
Example 3: Maxima
If closed forms are not available, but a good numerical solver is, then you can use that.
on
is a probability density.
We will choose
to generate from, so it must be easier than
to generate from.
For instance, for
bounded we can take the uniform
on
.
Algorithm:
with density
.
, independently
of
This is why it works:
We need more than one uniform to generate one random variable with the right distribution.
How many are needed?
This will depend on
.
The number of necessary trials before accepting
will be a geometric with
.
Example 1:
Suppose
is a bounded density on a compact support,
take
,
Repeat the folowing:
.
This is usually very inefficient, it is much better to take a bounding function that `fits'.
Example 2:
Simulating a normal using an exponential as a bounding function:
we use the fact that the absolute value of a normal has density:
majorizes
easily because we know how to
generate from an exponential of rate 1.
Algorithm:
Modified to short-cut cosines and sines:
Generate a point uniformly in the disk(1):
.
Call
,
Return
,
Exercice for next time (Tuesday April 22nd):
With matlab, do the flop count for both methods: