
Assistant
Professor of Statistics, Stanford University
Snail
mail: 390 Serra
Mall, Department of Statistics, Stanford University, Stanford, CA 94305.
Email: nzhang@stanford.edu
Fax: 650-725-8977
My
research is data driven, with the data coming primarily from current biological
applications. Today's data sets are so rich in structure and high in
dimension, that probing around one is bound to discover interesting statistical
problems, as well as new perspectives on classic statistical concepts. I
enjoy the modeling part of my work as much as the methodology part.
I am currently focusing on the analysis of DNA copy number data from
high-density SNP chips and next generation sequencing experiments. These
applications motivate new methods in change-point detection, scan statistics,
and model and variable selection.
Li, F and Zhang, NR, 2009, Bayesian Variable Selection in
Structured High-Dimensional Covariate Spaces with Applications in Genomics.
JASA Theory and Methods, in press.
We study Bayesian variable selection procedures in regression problems when the number of variables is very large and when there is a priori structural information. Certain computational and statistical problems arise that are unique to such high dimensional, structured settings, the most interesting being the phenomenon of phase transitions. This problem, however, can be mitigated by careful selection of hyper-parameters. The methods are illustrated on two different graph structures: the circular linear chain and the regular graph of degree k. They are also applied to study a specific application in genomics: the modeling of transcription factor binding sites in DNA sequences.
Zhang, NR,
Senbabaoglu, Y and Li, J, 2009, Joint Estimation of DNA Copy Number from
Multiple Platforms. Bioinformatics, in press.
In this paper, we propose a statistical model for pooling information across different platforms for copy number detection in a single biological sample. The model yields an intuitive statistic and a fast scanning algorithm. The software that we developed based on this model has been chosen by the Cancer Genome Atlas project to process all of their samples. Also, using an existing paired-end sequencing data set, we conducted a systematic comparison of the detection accuracy of microarray-based platforms and showed the benefits of the cross-platform procedure.
Zhang, N.R., Siegmund, D.O., Ji, H., and Li, J. 2009, Detecting simultaneous change-points in multiple sequences. Accepted for publication in Biometrika.
Download pdf, supplementary, software.
The problem studied here is motivated by
the biological problem of detecting recurrent copy number variants in multiple
genome samples. We propose simple
scan and segmentation algorithms based on summing the chi-square statistics for
each individual sample, which arises out of the generalized likelihood ratio
for a model where the errors in each sample are independent. The simple
geometry of the statistic allows us to derived accurate analytic approximations
to the significance level of such scans.
We show using replicates and parent-child comparisons that pooling data
across samples results in more accurate detection of copy number variants, and
that a cross-sample joint segmentation provides concise, interpretable
summaries for downstream analysis.
Chan, HP, Tu, IP and Zhang, NR, 2008, Boundary Crossing
Probability Computations in the Analysis of Scan Statistics, in Scan
Statistics: Methods and Applications, ed. Glaz, J. Pozdnyakov, V.
and Wallenstein, S., 89-105 (Boston: Birkhauser).
We review methods for obtaining accurate significance
level approximations for scan statistics, and show their application to
problems in epidemiology and genomics.
Lai, TL, Xing, H and Zhang, NR, 2008, Stochastic
segmentation models for array-based comparative genomic hybridization data
analysis. Biostatistics 9, 290-307.
We adapted stochastic change-point models for the analysis
of DNA copy number data. The novelty in our model is that it does not
restrict the number of underlying states of the hidden Markov model, and thus
is better suited for modeling cancer data, where fractional DNA copy number
changes are common. Exact equations for inference can be computed, with
fast approximation methods.
Zhang, NR, Wildermuth, MC, and
Speed, TP, 2008, Transcription factor binding site prediction with multivariate
gene expression data. Annals of Applied Statistics 2,
332-365.
Download pdf, software (begin by reading
file analysis_README).
We propose a
regression model that relates multivariate gene expression data to promoter
sequence data. Previous similar regression-based studies treated each sample in
a multi-sample experiment separately, thus losing sensitivity. We also proposed
a change-point model for the position effect of sequence motifs on gene
expression. The methods are applied
to the analysis of data sets from yeast and Arabidopsis.
The Encode Consortium, 2007, Identification
and analysis of functional elements in 1% of the human genome by the ENCODE
pilot project. Nature 447, 799-816.
This is the seminal paper for ENCyclopedia Of
Dna Elements (website). We proposed rigorous methods for
statistical inference of statistics that operate along the genome
sequence. The statistical version of this paper is still under
review and can be accessed here.
Chan, HP and Zhang, NR, 2007, Scan statistics with weighted
observations. 2007, JASA Theory and Methods 102, 595-602.
Download pdf, Matlab code for
analysis in paper.
Here we look at fixed-window scans of marked Poisson processes, for which we derive significance level approximations. These kinds of weighted scan statistics are useful for motif search in biological sequences, an example of which we study in detail in the paper.
Zhang, NR and Siegmund, DO, 2006, A Modified Bayes
Information Criterion with Applications to the Analysis of Comparative Genomic
Hybridization Data. Biometrics 63, 22-32.
We derive a more accurate asymptotic approximation to the Bayes factor for change-point models, which gives a new form for the Bayes Information Criterion (modified BIC) whose penalty for model dimension depends on the change-point estimates. The modified BIC is easy to compute and is demonstrated in the analysis of DNA copy number data.
Non Parametric Methods for Genomic Inference. (with Peter Bickel, Nathan Boley, Ben Brown,
and Haiyan Huang)
In this paper we propose a segmentation-based block bootstrap method for doing inference on statistics that operate along a piece-wise stationary sequence. We give conditions under which this strategy yields consistent inference. In particular, this strategy is used for assessing significance of overlaps between two genomic features that are defined along the genome sequence. The methods in this paper are described in cook-book form in the appendix of The Encode Consortium (2007).
Importance Sampling of Word Patterns in DNA and Protein
Sequences. (with Hock Peng Chan and Louis Chen)
We provide a general importance sampling algorithm for efficient Monte Carlo evaluation of small p-values of pattern counting test statistics and apply it on word patterns of biological interest, in particular palindromes and inverted repeats, patterns arising from position specific weight matrices.
Estimation of
parent specifc DNA copy number in tumors using high-density genotyping arrays.
(with Hao Chen and Haipeng Xing)
Most current statistical models for DNA copy number estimate total copy
number, which do not distinguish between the underlying quantities of the two
inherited chromosomes. This latter information, which we call parent specific copy number, is important for identifying
allele specific amplifications and deletions, for quantifying normal cell
contamination, and for giving a more complete molecular portrait of the
tumor. This paper proposes a Markov
jump process for this problem, which yields an inference procedure that can
efficiently process data from high-density genotyping arrays.
DNA copy number profiling in normal and tumor genomes.
In this book chapter, I survey some of the current
statistical and computational challenges in DNA copy number analysis.
Local average likelihood ratio test statistics with
applications in genomics and change-point detection. (with Hock Peng Chan)
Here we propose a multiple testing procedure based on the
average of maximum likelihood ratios (ALR). The neat thing is that the tail behavior
of the ALR does not depend on the correlation structure between the individual
test statistics, and thus its p-value is easy to asses in complex
problems. We apply this procedure
to several problems that arise in genomics.
An Algorithm for the Identification of Insertion-Deletion
Mutations from Next Generatio Sequencing.
(with Georges Natsoulis et al.)
We propose a method for detecting short insertions and
deletions using short-read genome re-sequencing data. The method is based on scanning for
characteristic features in the coverage depth and mismatch frequency profiles.
Statistics 191 Applied statistics.
Statistics 203 Introduction to regression models and analysis of variance.
Statistics 205 Nonparametric statistics.
Statistics 215 Stochastic processes in Biology.
Statistics 345 Special topics course on computational biology. (Spring 2008)
Statistics 345/Genetics 245 Computational algorithms for statistical genetics. (Spring 2009)