Robert Tibshirani
Stanford University
Thanks very much to Louis Staudt and George Wright for all of their help,
providing data and details on their analysis.
Thanks also to Michael LeBlanc (FHCRC) and a number of Stanford colleagues for helpful conversations.
I re-analyzed their data and found that it was not reproducible. In particular when I applied their algorithm as they described it, I could not get the same results as they did on their training and test sets (although there are many models that have low p-values on both the training and test sets). Furthermore when their equal-sized training and test sets are swapped, and their model-building procedure is re-applied, virtually nothing is significant in the test set. Also, when a small change is made to their model-building recipe (changing the allowable cluster size range from [25,50] to [30,60]) with either the original or swapped datasets, again, very little of significance emerges. This and other analyses suggest that their result occurred by chance and is not robust or reproducible.
Other analyses (e.g. application of SAM and supervised principal components) suggest that there is little or no correlation between gene expression and patient survival in this dataset.
My re-analysis was mentioned in a brief letter to the editor in NEJM, (April 7, 2005) with full details given here.
The authors responded to my letter in the same issue.
My rebuttal appears at the bottom of this page
Wan-Jen Hong and Gilbert Chu also published a letter on the Dave et. al.
paper:
see Hong and Chu's website
R script of the re-analysis
dave.gtr file used in re-analysis
dave.ctr file used in re-analysis
My motivation for carrying out this extensive re-analysis was a practical one. My medical collaborators wanted to know if the biological result in this paper was likely to be real and reproducible. After my analysis, I have concluded that it is probably not reproducible and occurred by chance. We will have to wait for followup studies to determine whether I'm right or wrong.
I also think that this exercise is informative with regard to microarray data analyses in general. In particular I think it is useful to determine the degree to which an analysis is fragile. Even if an analysis produces small test set p-values, a scientist should be concerned if small but reasonable changes in the analysis strategy cause large changes in the results. With microarray analyses, there are many choices that one has to make, and one hopes that the results are not too sensitive to these choices.
Most readers do not have the time to spend weeks reconstructing
a data analysis (as I did here).
For this reason, it would be very helpful if authors in general provided
not only details of their analysis, but a software script that carries it
out. That way the reader can assess for himself the fragility of the results.
Dave et. al responded to my letter in the Apr. 7, 2005 issue of NEJM.
Their first main point: they
used a standard approach for constructing a survival
predictor, and then found a small (significant) p-value on the test set.
Therefore my analysis does nothing to dispute the strength of their finding.
My response: in judging the strength of evidence from a study,
a single p-value does not tell the whole story.
The reader should go over my re-analysis and judge this for him/herself.
Their second counter-argument is given in Figure 1 of their letter to the
Editor. They randomly selected new equal-sized training and test sets from the data,
and reapplied their original model to each new half-set. They found that
every resulting p-value was less than 0.011, with a median of 0.001.
My response:
This is not surprising and tells us nothing. The original model was highly significant
(p < 10e-8) on the original training set, simply as a result of the fitting
process. And we already know that it is significant on the original
test set (p=.003). Therefore it must be significant on any half of the data
that we choose.
To learn about the robustness or fragility of their model,
one must go through the entire model building process from scratch
(as I have done above).
Finally they say that it is not surprising that SAM couldn't find any association,
since the association they claim to have found depends on the synergy
between two
clusters. And they say that "Tibshirani confuses the ability
of our method to discover a survival association with the fact that we actually found one that validated the association".
My response: It is true that SAM only finds marginal associations,
and hence could not find a synergistic association that exhibited no
marginal effects. However such an association is rare and difficult
to find in a "sea" of 49,000 genes. The bar should be set high
for the strength of evidence of such an association, and in my view,
that bar has not been reached here. Re-application of their method
does not "confuse" things, but rather tells us about the robustness of their
findings.
My rebuttal to the authors' response