Description of the movie

There are 100 genes (rows) and 20 samples (columns). The expression data are shown in a heatmap, with blue=low, yellow=high.

The first principal component of the expression data is high for samples 11-20 and low for samples 1-10. The second principal component is high for samples 11-15 and 16-20, and low otherwise. There is a block of 50 genes with patterns like the first PC, and another block of 25 genes with patterns like the second PC. There is a third block of 25 genes that are just noise.

The outcome y is binary, indicated by the red circles in the bottom panel. It is aligned with the second principal component.

The gene scores- correlations of each gene with y, are shown in the right panel.

The black X's in the bottom panel show the predicted values from the supervised principal component method, as the score threshold (broken red vertical lines in the right panel) is varied. Genes marked with a green triangle have scores exceeding the threshold, and are used in the PC analysis at each stage.

The initial prediction uses all genes, and hence is just a regression onto the first principal component. From the way this example is constructed, it is not surprising that this does a poor job of prediction.

The movie shows what happens as the threshold is varied. The predictions slowly improve. When the correct 25 genes are used, the predictions are nearly perfect. When fewer than 25 genes are used, the predictions degrade a bit.