Nancy R. Zhang

Assistant Professor of Statistics, Stanford University

Contact

Snail mail: 390 Serra Mall, Department of Statistics, Stanford University, Stanford, CA 94305.
Email: nzhang@stanford.edu
Fax: 650-725-8977


Click here for a pdf copy of my CV.

Current research interests

My research is data driven, with the data coming primarily from current biological applications.  Today's data sets are so rich in structure and high in dimension, that probing around one is bound to discover interesting statistical problems, as well as new perspectives on classic statistical concepts.  I enjoy the modeling part of my work as much as the methodology part.   I am currently focusing on the analysis of DNA copy number data from high-density SNP chips and next generation sequencing experiments.  These applications motivate new methods in change-point detection, scan statistics, and model and variable selection.

Publications

Li, F and Zhang, NR, 2009, Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces with Applications in Genomics. JASA Theory and Methods, in press.

Download pdf, software.  

We study Bayesian variable selection procedures in regression problems when the number of variables is very large and when there is a priori structural information.  Certain computational and statistical problems arise that are unique to such high dimensional, structured settings, the most interesting being the phenomenon of phase transitions. This problem, however, can be mitigated by careful selection of hyper-parameters. The methods are illustrated on two different graph structures: the circular linear chain and the regular graph of degree k. They are also applied to study a specific application in genomics: the modeling of transcription factor binding sites in DNA sequences.  

Zhang, NR, Senbabaoglu, Y and Li, J, 2009, Joint Estimation of DNA Copy Number from Multiple Platforms.  Bioinformatics, in press.

Download pdf, software.  

In this paper, we propose a statistical model for pooling information across different platforms for copy number detection in a single biological sample.  The model yields an intuitive statistic and a fast scanning algorithm.  The software that we developed based on this model has been chosen by the Cancer Genome Atlas project to process all of their samples.  Also, using an existing paired-end sequencing data set, we conducted a systematic comparison of the detection accuracy of microarray-based platforms and showed the benefits of the cross-platform procedure.  

Zhang, N.R., Siegmund, D.O., Ji, H., and Li, J. 2009, Detecting simultaneous change-points in multiple sequences. Accepted for publication in Biometrika.

Download pdf, supplementary, software.                       

The problem studied here is motivated by the biological problem of detecting recurrent copy number variants in multiple genome samples.  We propose simple scan and segmentation algorithms based on summing the chi-square statistics for each individual sample, which arises out of the generalized likelihood ratio for a model where the errors in each sample are independent. The simple geometry of the statistic allows us to derived accurate analytic approximations to the significance level of such scans.  We show using replicates and parent-child comparisons that pooling data across samples results in more accurate detection of copy number variants, and that a cross-sample joint segmentation provides concise, interpretable summaries for downstream analysis.

Chan, HP, Tu, IP and Zhang, NR, 2008, Boundary Crossing Probability Computations in the Analysis of Scan Statistics, in Scan Statistics:  Methods and Applications,  ed. Glaz, J. Pozdnyakov, V. and Wallenstein, S., 89-105  (Boston: Birkhauser). 

Download pdf.  

We review methods for obtaining accurate significance level approximations for scan statistics, and show their application to problems in epidemiology and genomics.   

Lai, TL, Xing, H and Zhang, NR, 2008, Stochastic segmentation models for array-based comparative genomic hybridization data analysis. Biostatistics 9, 290-307. 

Download pdf, software.  

We adapted stochastic change-point models for the analysis of DNA copy number data.  The novelty in our model is that it does not restrict the number of underlying states of the hidden Markov model, and thus is better suited for modeling cancer data, where fractional DNA copy number changes are common.  Exact equations for inference can be computed, with fast approximation methods.    

Zhang, NR, Wildermuth, MC, and Speed, TP, 2008, Transcription factor binding site prediction with multivariate gene expression data. Annals of Applied Statistics 2, 332-365. 

Download pdf, software (begin by reading file analysis_README).  

We propose a regression model that relates multivariate gene expression data to promoter sequence data. Previous similar regression-based studies treated each sample in a multi-sample experiment separately, thus losing sensitivity. We also proposed a change-point model for the position effect of sequence motifs on gene expression.  The methods are applied to the analysis of data sets from yeast and Arabidopsis.

The Encode Consortium, 2007, Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447, 799-816

Download pdf, software.  

This is the seminal paper for ENCyclopedia Of Dna Elements (website).  We proposed rigorous methods for statistical inference of statistics that operate along the genome sequence.   The statistical version of this paper is still under review and can be accessed here.  

Chan, HP and Zhang, NR, 2007, Scan statistics with weighted observations. 2007, JASA Theory and Methods 102, 595-602. 

Download pdf, Matlab code for analysis in paper.  

Here we look at fixed-window scans of marked Poisson processes, for which we derive significance level approximations.  These kinds of weighted scan statistics are useful for motif search in biological sequences, an example of which we study in detail in the paper.  

Zhang, NR and Siegmund, DO, 2006, A Modified Bayes Information Criterion with Applications to the Analysis of Comparative Genomic Hybridization Data. Biometrics 63, 22-32.

Download pdf, software.  

We derive a more accurate asymptotic approximation to the Bayes factor for change-point models, which gives a new form for the Bayes Information Criterion (modified BIC) whose penalty for model dimension depends on the change-point estimates.  The modified BIC is easy to compute and is demonstrated in the analysis of DNA copy number data.  

 

Preprints

 

Non Parametric Methods for Genomic Inference. (with Peter Bickel, Nathan Boley, Ben Brown, and Haiyan Huang)

Download pdf, software.  

In this paper we propose a segmentation-based block bootstrap method for doing inference on statistics that operate along a piece-wise stationary sequence.  We give conditions under which this strategy yields consistent inference.  In particular, this strategy is used for assessing significance of overlaps between two genomic features that are defined along the genome sequence.  The methods in this paper are described in cook-book form in the appendix of The Encode Consortium (2007).  

Importance Sampling of Word Patterns in DNA and Protein Sequences.  (with Hock Peng Chan and Louis Chen)

Download pdf.  

We provide a general importance sampling algorithm for efficient Monte Carlo evaluation of small p-values of pattern counting test statistics and apply it on word patterns of biological interest, in particular palindromes and inverted repeats, patterns arising from position specific weight matrices.  

Estimation of parent specifc DNA copy number in tumors using high-density genotyping arrays. (with Hao Chen and Haipeng Xing)

Download pdf, software.  

Most current statistical models for DNA copy number estimate total copy number, which do not distinguish between the underlying quantities of the two inherited chromosomes. This latter information, which we call parent specific copy number, is important for identifying allele specific amplifications and deletions, for quantifying normal cell contamination, and for giving a more complete molecular portrait of the tumor.  This paper proposes a Markov jump process for this problem, which yields an inference procedure that can efficiently process data from high-density genotyping arrays.

DNA copy number profiling in normal and tumor genomes.

Download pdf.  

In this book chapter, I survey some of the current statistical and computational challenges in DNA copy number analysis.

Local average likelihood ratio test statistics with applications in genomics and change-point detection. (with Hock Peng Chan)

Download pdf.  

Here we propose a multiple testing procedure based on the average of maximum likelihood ratios (ALR).  The neat thing is that the tail behavior of the ALR does not depend on the correlation structure between the individual test statistics, and thus its p-value is easy to asses in complex problems.  We apply this procedure to several problems that arise in genomics.

An Algorithm for the Identification of Insertion-Deletion Mutations from Next Generatio Sequencing.  (with Georges Natsoulis et al.)

We propose a method for detecting short insertions and deletions using short-read genome re-sequencing data.  The method is based on scanning for characteristic features in the coverage depth and mismatch frequency profiles.

  

Courses

         Statistics 191                              Applied statistics.

         Statistics 203                              Introduction to regression models and analysis of variance.

         Statistics 205                              Nonparametric statistics.

         Statistics 215                              Stochastic processes in Biology.

         Statistics 345                              Special topics course on computational biology.  (Spring 2008)

         Statistics 345/Genetics 245        Computational algorithms for statistical genetics. (Spring 2009)