Re-analysis of Dave et al, NEJM Nov 18, 2004

Robert Tibshirani
Stanford University

Thanks very much to Louis Staudt and George Wright for all of their help, providing data and details on their analysis.
Thanks also to Michael LeBlanc (FHCRC) and a number of Stanford colleagues for helpful conversations.

Summary

In an interesting and important paper:

Dave et al. "Prediction of survival in follicular lymphoma based on molecular features of tumor infiltrating cells". NEJM Nov. 18, 2004 vol 351:2159-2169,

the authors derive a model for predicting patient survival from gene expression data using two "immune response" clusters, IR1 (good prognosis) and IR2 (poor prognosis). A Cox model using expression averages from the IR1 and IR2 clusters was constructed, and this model had a highly significant p-value (0.003 or less) in an independent test set.

I re-analyzed their data and found that their result is extremely fragile. In particular, when their equal-sized training and test sets are swapped, and their model-building procedure is re-applied, their finding disappears and virtually nothing is significant in the test set. Also, when a small change is made to their model-building recipe (changing the allowable cluster size range from [25,50] to [30,60]) with either the original or swapped datasets, again, their finding disappears and very little of significance emerges. This and other analyses suggest that their result occurred by chance and is not robust or reproducible.

Other analyses (e.g. application of SAM and supervised principal components) suggest that there is little or no correlation between gene expression and patient survival in this dataset.

My re-analysis was mentioned in a brief letter to the editor in NEJM, (April 7, 2005) with full details given here.

The authors responded to my letter in the same issue. My rebuttal appears at the bottom of this page

Wan-Jen Hong and Gilbert Chu also published a letter on the Dave et. al. paper: see Hong and Chu's website

A Stanford graduate student- Ray Lin- has recently does some further re-analysis that confirms my findings: Ray Lin's report

Details:


The steps in the authors' modelling procedure were:

0. Divide the data randomly into training and test sets of approximately equal numbers of patients. Apply the following recipe [steps 1--5] to the training set.

1. Choose all genes with univariate Cox score > 1.5 in absolute value. This reduced the number of genes from roughly 49,000 to roughly 3,200, with about a 50-50 split between good prognosis genes (negative scores) and poor prognosis genes (positive scores).

2. Do separate hierarchial clusterings of the good and poor prognosis genes.

3. Find all clusters in the dendrograms (clustering trees) containing between 25 and 50 genes, with internal correlation at least 0.5. Represent each cluster by the average expression of all genes in the cluster-- a "supergene".

4. Try every pair of supergenes as predictors in Cox models for predicting survival.

5. Choose the most significant pair from this process. The authors call the resulting pair of clusters IR1 (good prognosis) and IR2 (poor prognosis).

6. Finally use the model (IR1, IR2) in a Cox model to predict survival in the test set.





The following Figures show my re-analysis of their data and modelling procedure:



Fig 1: Training and test p-values for all cluster pairs. Each data point represents the training and test p-values from a Cox model containing two clusters from the set of all clusters that passed the filtering in the first part of the analysis. All possible two-cluster models are represented in the plot. The (IR1, IR2) cluster model corresponds to the island of points in the bottom left of the plot. All of these pairs use the IR2 cluster, and variations on the IR1 cluster. (I didn't rule out clusters that were proper subsets of other chosen clusters). The (IR1, IR2) model has the smallest training set p-value, so it is reasonable to choose this model and validate it on the test set, as done by the authors. But we also note that the total number of points (cluster pairs) with test set p-values less than 0.05 (239) is far fewer than we'd expect to see by chance (735), even if there was no correlation between gene expression and survival. This suggests that there may be no overall significance in this dataset.



Fig 2: Training and test p-values for all individual clusters. IR1 and IR2 clusters are shown in red. Notice that IR1 (good prognosis) is not significant on its own in the test set, and both IR1 and IR2 have only average significance in the training set.



Fig 3: Training and test p-values for all cluster pairs, with training and test sets swapped. The original training and test sets were of approximately equal size, and were chosen at random. Hence it seems reasonable to swap them, train on the original test set and test on the original training set. We see that the authors' finding does not appear, even approximately. Notice also that only 5 pairs out of 8930 have test set p-values less than 0.05. Even if there was no correlation between gene expression and survival, we'd expect 8930*.05= 447 significant pairs.



Fig 4: Training and test p-values for all individual clusters, swapped data. Notice that no individual clusters were significant in the test set at .05 level.



Fig 5: Training and test p-values for all cluster pairs (with original training and test datasets), using a cluster size range of [30,60] genes instead of [25,50]. In choosing this range, I have intentionally ruled out the strong IR2 cluster, which has 27 genes. But one would expect that if the authors' finding was robust, we would find some other cluster with significant overlap with the IR2 cluster. But the authors' finding does not appear, even approximately. And there are only 85 pairs out of 11628 that are significant in the test set at the 0.05 level, while we we would expect 11628*.05=581 pairs just by chance.


Fig 6: Training and test p-values for all cluster pairs, with training and test sets swapped. Here only maximal clusters are considered. Notice that no pairs are significant in the test set at the 0.05 level.



Full details of the re-analysis

R script of the re-analysis
dave.gtr file used in re-analysis
dave.ctr file used in re-analysis



Editorial comment

My motivation for carrying out this extensive re-analysis was a practical one. My medical collaborators wanted to know if the biological result in this paper was likely to be real and reproducible. After my analysis, I have concluded that it is probably not reproducible and occurred by chance. We will have to wait for followup studies to determine whether I'm right or wrong.

I also think that this exercise is informative with regard to microarray data analyses in general. In particular I think it is useful to determine the degree to which an analysis is fragile. Even if an analysis produces small test set p-values, a scientist should be concerned if small but reasonable changes in the analysis strategy cause large changes in the results. With microarray analyses, there are many choices that one has to make, and one hopes that the results are not too sensitive to these choices.

Most readers do not have the time to spend weeks reconstructing a data analysis (as I did here). For this reason, it would be very helpful if authors in general provided not only details of their analysis, but a software script that carries it out. That way the reader can assess for himself the fragility of the results. Finally, this points to the importance of "canned" packages that do model search and cross-validation automatically. These make re-analyses much simpler to do and also help the data analyst avoid overfitting. Links to a number of public domain packages, developed by my lab, can be found on my main homepage Tibshirani homepage.



My rebuttal to the authors' response

Dave et. al responded to my letter in the Apr. 7, 2005 issue of NEJM.

Their first main point: they used a standard approach for constructing a survival predictor, and then found a small (significant) p-value on the test set. Therefore my analysis does nothing to dispute the strength of their finding.

My response: in judging the strength of evidence from a study, a single p-value does not tell the whole story. The reader should go over my re-analysis and judge this for him/herself.

Their second counter-argument is given in Figure 1 of their letter to the Editor. They randomly selected new equal-sized training and test sets from the data, and reapplied their original model to each new half-set. They found that every resulting p-value was less than 0.011, with a median of 0.001.

My response: This is not surprising and tells us nothing. The original model was highly significant (p < 10e-8) on the original training set, simply as a result of the fitting process. And we already know that it is significant on the original test set (p=.003). Therefore it must be significant on any half of the data that we choose. To learn about the robustness or fragility of their model, one must go through the entire model building process from scratch (as I have done above).

Finally they say that it is not surprising that SAM couldn't find any association, since the association they claim to have found depends on the synergy between two clusters. And they say that "Tibshirani confuses the ability of our method to discover a survival association with the fact that we actually found one that validated the association".

My response: It is true that SAM only finds marginal associations, and hence could not find a synergistic association that exhibited no marginal effects. However such an association is rare and difficult to find in a "sea" of 49,000 genes. The bar should be set high for the strength of evidence of such an association, and in my view, that bar has not been reached here. Re-application of their method does not "confuse" things, but rather tells us about the robustness of their findings.