Re-analysis of Dave et al, NEJM Nov 18, 2004

Robert Tibshirani
Stanford University

Thanks very much to Louis Staudt and George Wright for all of their help, providing data and details on their analysis.
Thanks also to Michael LeBlanc (FHCRC) and a number of Stanford colleagues for helpful conversations.

Summary

In an interesting and important paper:

Dave et al. "Prediction of survival in follicular lymphoma based on molecular features of tumor infiltrating cells". NEJM Nov. 18, 2004 vol 351:2159-2169,

the authors derive a model for predicting patient survival from gene expression data using two "immune response" clusters, IR1 (good prognosis) and IR2 (poor prognosis). A Cox model using expression averages from the IR1 and IR2 clusters was constructed, and this model had a highly significant p-value (0.003 or less) in an independent test set.

I re-analyzed their data and found that it was not reproducible. In particular when I applied their algorithm as they described it, I could not get the same results as they did on their training and test sets (although there are many models that have low p-values on both the training and test sets). Furthermore when their equal-sized training and test sets are swapped, and their model-building procedure is re-applied, virtually nothing is significant in the test set. Also, when a small change is made to their model-building recipe (changing the allowable cluster size range from [25,50] to [30,60]) with either the original or swapped datasets, again, very little of significance emerges. This and other analyses suggest that their result occurred by chance and is not robust or reproducible.

Other analyses (e.g. application of SAM and supervised principal components) suggest that there is little or no correlation between gene expression and patient survival in this dataset.

My re-analysis was mentioned in a brief letter to the editor in NEJM, (April 7, 2005) with full details given here.

The authors responded to my letter in the same issue. My rebuttal appears at the bottom of this page

Wan-Jen Hong and Gilbert Chu also published a letter on the Dave et. al. paper: see Hong and Chu's website

Details:

A Stanford graduate student- Ray Lin- has done the most thorough re-analysis, and reaches the same conclusions: Ray Lin's report

Details of my re-analysis

R script of the re-analysis
dave.gtr file used in re-analysis
dave.ctr file used in re-analysis



Editorial comment

My motivation for carrying out this extensive re-analysis was a practical one. My medical collaborators wanted to know if the biological result in this paper was likely to be real and reproducible. After my analysis, I have concluded that it is probably not reproducible and occurred by chance. We will have to wait for followup studies to determine whether I'm right or wrong.

I also think that this exercise is informative with regard to microarray data analyses in general. In particular I think it is useful to determine the degree to which an analysis is fragile. Even if an analysis produces small test set p-values, a scientist should be concerned if small but reasonable changes in the analysis strategy cause large changes in the results. With microarray analyses, there are many choices that one has to make, and one hopes that the results are not too sensitive to these choices.

Most readers do not have the time to spend weeks reconstructing a data analysis (as I did here). For this reason, it would be very helpful if authors in general provided not only details of their analysis, but a software script that carries it out. That way the reader can assess for himself the fragility of the results. Finally, this points to the importance of "canned" packages that do model search and cross-validation automatically. These make re-analyses much simpler to do and also help the data analyst avoid overfitting. Links to a number of public domain packages, developed by my lab, can be found on my main homepage Tibshirani homepage.



My rebuttal to the authors' response

Dave et. al responded to my letter in the Apr. 7, 2005 issue of NEJM.

Their first main point: they used a standard approach for constructing a survival predictor, and then found a small (significant) p-value on the test set. Therefore my analysis does nothing to dispute the strength of their finding.

My response: in judging the strength of evidence from a study, a single p-value does not tell the whole story. The reader should go over my re-analysis and judge this for him/herself.

Their second counter-argument is given in Figure 1 of their letter to the Editor. They randomly selected new equal-sized training and test sets from the data, and reapplied their original model to each new half-set. They found that every resulting p-value was less than 0.011, with a median of 0.001.

My response: This is not surprising and tells us nothing. The original model was highly significant (p < 10e-8) on the original training set, simply as a result of the fitting process. And we already know that it is significant on the original test set (p=.003). Therefore it must be significant on any half of the data that we choose. To learn about the robustness or fragility of their model, one must go through the entire model building process from scratch (as I have done above).

Finally they say that it is not surprising that SAM couldn't find any association, since the association they claim to have found depends on the synergy between two clusters. And they say that "Tibshirani confuses the ability of our method to discover a survival association with the fact that we actually found one that validated the association".

My response: It is true that SAM only finds marginal associations, and hence could not find a synergistic association that exhibited no marginal effects. However such an association is rare and difficult to find in a "sea" of 49,000 genes. The bar should be set high for the strength of evidence of such an association, and in my view, that bar has not been reached here. Re-application of their method does not "confuse" things, but rather tells us about the robustness of their findings.