If we are interested in the
behaviour of a random variable
,
then we can consider the sequence
of
new values
obtained through
computation of
new bootstrap samples.
Practically speaking this will need generatation of an integer between 1 and n, each of these integers having the same probability.
Here is an example of a line of matlab that does just that: indices=randint(1,n,n)+1; Or if you have the statistics toolbox, you can use: indices=unidrnd(n,1,n);
If we use S we won't need to generate the new observations one by one, the following command generates a n-vector with replacement in the vector of indices (1...n).
sample(n,n,replace=T)
An approximation of the distribution of the
estimate
is provided by the distribution
of
If we were given
true samples, and their associated
estimates
,
we could compute the usual variance estimate
for this sample of
values, namely:
Treatment Group
treat=[94 38 23 197 99 16 141]'
treat =
94
38
23
197
99
16
141
>> median(treat)
ans = 94
>> mean(treat)
ans = 86.8571
>> var(treat)
ans = 4.4578e+03
>> var(treat)/7
ans =
636.8299
>> sqrt(637)
ans = 25.2389
thetab=zeros(1,1000);
for (b =(1:1000))
thetab(b)=median(bsample(treat));
end
hist(thetab)
>> sqrt(var(thetab))
ans =
37.7768
>> mean(thetab)
ans =
80.5110
This is what the histogram looks like:
|
Control Group
control=[52 104 146 10 51 30 40 27 46]'; >> median(control) ans = 46 >> mean(control) ans = 56.2222 >> var(control) ans = 1.8042e+03 >> var(control)/length(control) ans = 200.4660 >> sqrt(200.4660) ans = 14.1586 thetab=zeros(1,1000); for (b =(1:1000)) thetab(b)=median(bsample(control)); end hist(thetab) >> sqrt(var(thetab)) ans = 11.9218 >> mean(thetab) ans = 45.4370This is what the histogram looks like:
|
Comparing the two medians, we could use the estimates of the standard errors to find out if the difference between the two medians is significant?
Suppose we condition on the sample of
distinct observations
,
there are as many different samples as there are ways of choosing
objects out
of a set of
possible contenders, repetitions being allowed.
At this point it is interesting to introduce a new
notation for a bootstrap resample,
up to now we have noted a possible
reasample, say
,
because of the exchangeability/symmetry property
we can recode this as the
vector counting
the number of occurrences of each of the observations.
in this recoding we have
and the set of all bootstrap resamples
is the
dimensional simplex
here is the function file approxcom.m
function out=approxcom(n) out=round((pi*n)^(-.5)*2^(2*n-1));that produces the following table of the number of resamples:
Are all these samples equally likely, thinking about the probability
of drawing the sample of all
's by choosing the index
times in the integer uniform generation should persuade you
that this sample appears only once in
times.
Whereas the sample with
once and
all the other
observations can appear in
out of the
ways.