Student Research Samples
We have put together some brief descriptions of student
research. They are meant to provide a general view of the
kinds of problems that are being tackled in the
department. You will find out why our students chose a
career in statistics rather than another field.
Sudeshna Adak
When I came to Stanford, I had an undergraduate and a masters degree in
statistics. But, I found that I was not really prepared to jump into
research and start writing my thesis. I had no idea what I
wanted to work on. I spent my first year at Stanford wandering aimlessly
and wondering if I would not have been better off in a ``real'' job. I
also read a lot of books (on statistics, among other things) that seemed
interesting. In those early days and even today, it always seems that
the hardest part of research is finding a problem; one that is both
interesting and that I think I can solve (with my capabilities). In this
very undecided state, I went to Dr. Iain Johnstone and said that I
wanted to work on nonparametric stuff. I did not have a clue as to what
stuff, though I don't think I said that explicitly to Iain. I knew that
Iain and David Donoho were working on wavelets and it seemed like a new
and exciting area of research. This got me started reading (with Iain's
help) about wavelets.
The aspect of wavelets that caught my interest was that it was
being proposed as a better methodology in analyzing data - an
alternative to more well-known and well-established methods. For
instance, Iain and David had shown the usefulness of wavelet methods
in nonparametric function estimation. Wavelets were also being used for
their feature-localization properties where more standard methods could
not be used. For instance, Fourier methods had been in use for a long time in
studying the periodic behavior in signals. But Fourier analysis is
useful in identifying only global frequencies. So, if a string is
vibrating at a frequency of 5Hz, the Fourier transform of the resulting
sound will show a peak at 5Hz (or near abouts). But, when the frequencies
change over time, Fourier analysis has to be modified. For instance, if
you were to say ``Hello, how are you ?'', then each syllable has its
own signature frequency. These letters/syllables with their signature
frequency are called phonemes. So, to use Fourier analysis, you have to divide
up the sentence ``Hello, how are you?'' into its phonemes. So, how do
you divide up a sentence like this into phonemes? This was the problem I
started working on for my thesis. I then found that the same kind of
problem existed in another category of signals - seismograms. Seismogram
records of earthquakes also exhibit frequencies changing over time
which can (I hope) tell the trained geophysicist a lot about the nature
and location of the earthquake. I felt I had found a simple and
interesting problem for my thesis.
To solve this problem of detecting local changes in the
frequency patterns, I considered what is intuitively the natural solution.
As the frequencies change over time, if you were to break up the signal
into very small segments, then each tiny segment would be approximately
homogeneous i.e. the frequency patterns should remain roughly constant
within that time period. Consider a segment by comparing its left and
right halves. If the frequency patterns of the two halves are similar to one
another, then the two halves can be merged. Otherwise, the segment
should be divided in half. So, thinking of this in
mathematical terms, if
and
or every segmentation of
the signal, we can define a measure
where
[
represent a measure of distance between the frequency
patterns. ]
This measure quantifies how far the segmentation S is from the
ideal one in which segment is homogeneous i.e.
and
are identical for every segment.
The next step is to devise an algorithm to search over all possible
segmentations and minimize . But, this is computationally too
intensive and so we restrict ourselves to searching over all
segmentations that occur at dyadic points. And the rest is history.
Figure 1 shows the seismogram of the Loma-Prieta
earthquake which hit San Francisco (and Stanford) in 1989. In
Figure 2, I have shown the best (dyadic) segmentation
tree. Figure 3 shows the relation between the frequency
of vibration and time during the earthquake.
I spent the next year working on the finer points of the algorithm and
still am. The best part of my experience at Stanford was the
encouragement I received from Iain and other faculty members. Another
aspect of the Stanford curriculum that I enjoyed very much is that
after the first year (a very busy one), you are essentially free to
take any course in any department. This gave me and many other
graduate students as well a lot of freedom to take courses in other
departments and find problems of interest to which we can apply our
statistical methods.
One of the thesis-unrelated projects that I worked on was a gui
(graphical-user interface) for a paper on wavelets - Exact Risk
Analysis of Wavelet Regression (Marron et. al. (1996)). The gui is a
nifty thing that lets the user click buttons to select one of several
options, such as select a data set in the database and the particular
wavelet basis that you wish to use. Once you have selected the
options, the program automatically puts up the results on a screen. As
a student in India, my computing experience had amounted to just a
little more than logging in and out. And here was this whole new
world. Writing the computer code for the gui (and for other similar
guis) was a fascinating learning experience. (It felt a little like
what it must have been when computers came into use).
I also spent a few weeks collaborating with Dr John Johnson and Dr.
Daria Mochly-Rosen who were from the department of Molecular
Pharmacology in the Stanford Medical School. The doctors had come to
Iain with a problem in biostatistics and Iain brought it to my notice.
Dr. Johnson and Dr. Mochly-Rosen were looking at the effect of
prolonged phorbol ester treatment on the cardiac cells of neonatal
rats. The phorbol (PMA) is known to activate the protein kinase C
(PKC) which controls the rate of contraction of the heart. Different
concentrations of PMA can effectively control the activity of the
protein kinase C and through that the rate of contraction of cardiac
cells. The ultimate objective is to obtain a better understanding of
the protein kinase C in the modulation of cardiac functions which may
reveal selective targets for therapies of cardiac disorders. The
biostatistical problem at hand was to figure out the relationship
between concentration of PMA and the contraction rate of the cardiac
cells (of new-born rats). This project actually brought me in touch
with ``real data'' where I could experiment with different models and
statistical methodologies such as generalized linear models, bootstrap
and nonparametric curve fitting.
I am now working as a post-doc in the Harvard School of Public Health
(where I will be an assistant professor from next fall). I spent four
wonderful years at the Department of Statistics at Stanford. I had the
chance to work on some fascinating projects and I always had the
feeling that I was learning a lot. And, I shall miss Stanford!
Gareth James
I have recently become interested in exploring the new developments
in classification problems. A classification problem is where you are
trying to use some input data to classify (or predict) to one of
several possible groups. A good example is the US Post Office Zip Code
Data. With this data you have the pixels from a scanned image of a
hand written digit. The aim is to use the data from the pixels (ie
position and shade etc) to predict the class of the image
(i.e. 0,1,...,9). However classification problems abound in every
field from medicine to finance.
Here is a simple example to illustrate how one might proceed in
doing the classification. Consider the digit 2. Since
different people write differently, the digit might arrive in many
different shapes. The first step is to transform the digit into
a standard form. The first two rows below illustrate how the
original shape is transformed into a standard form which is then used
to classify the digit.
The picture below shows "before-and-after" results for the
digits 2,3,4.
There have been some exciting developments in this area
recently. With old classification techniques being used in new more
adaptive ways which produce far better results. However at present
there is not really a clear understanding of why a number of these new
methods work so well, which means there is plenty of scope for new
research.
Until I arrived at Stanford I had only been exposed to more
classical statistics (ANOVA, confidence intervals and hypothesis
testing etc.) and it had not occurred to me that a statistician could
provide any insight for this sort of problem. However I quickly
discovered that statistics is a far more diverse field and that this
is exactly the sort of thing that we are trained to tackle.
Catherine Sugar
Hi! My name is Catherine Sugar, and I am a third year PhD student.
Actually, this is my fifth year at Stanford; I started doing a PhD in
the mathematics department, but switched to statistics after two
years. I really liked theoretical mathematics, and still do, but as
time went on, I decided I really needed to do something with more
immediate applications and implications for helping people.
Statistics allows me to do mathematical research, some of it very
abstract, and at the same time work on applied consulting projects.
All in all, I am delighted with the decision I made to switch fields.
The problems I am working on are exciting and so is the atmosphere in
the statistics department.
My main statistical interests have to do with clustering and
classification problems. I am particularly interested in applications
to medical problems, and hope to get an academic position in
biostatistics when I finish my degree. At the moment I am working on
three different problems. The newest one has to do with identifying
groups of amino acids in HIV proteins that tend to vary at the same
time. If this can be done successfully, it will greatly increase
understanding of how these proteins work, and perhaps what can be done
to prevent adaptive mutations. I am just starting to work on this
problem, so don't have results to report yet, but it looks very
exciting.
I am also involved in a project with people in the nephrology
division of the medical school who study auto-immune iseases of the
kidney. People with these diseases can become extremely ill, and
often require dialysis and transplants to survive if agressive
measures are not taken early in the illness. Unfortunately, these
treatments have many dangerous and unpleasant side effects.
Furthermore, although some patients ``go to hell in a hand-basket,''
some deteriorate very gently and gradually, and others show symptoms,
but retain normal kidney function for a very long time. Therefore,
you really only want to use drastic measures on the patients whose
condition will deteriorate rapidly. The trick is to predict, based on
data on initial medical tests, which patients will fall into which of
the three categories. Unfortunately, standard methods such as linear
and quadratic discriminant analysis do not work well in this context.
There are many, many initial variables, and very few data points. (In
fact we only have 40 patients with full data, approximately equally
split among the three groups.) As a result, the estimated covariance
matrices needed to do discriminant analysis tend to be very unstable,
or even singular. We are looking at ways to get around this problem
using more robust ``shrinkage'' estimators. The paper that is the
basis for this application was written by Jerry Friedman, a professor
in our department, and is called Regularized Discriminant
Analysis (JASA March 1989.) If the technique works well, it will
have implications for many similar medical problems.
The project that has taken most of my time over the last few months
is some collaborative work with people in the clinical pharmacology
division. They are interested in assessing the quality of life of
patients with various diseases, as a basis for making public policy
decisions about how to spend medical resources. (This sort of work
was the basis for the Oregon Health Care Initiative several years
ago.) In order to measure quality of life, it is necessary to break
the spectrum of patient conditions into discrete ``health states''
which the patients can evaluate. In the past, this division has been
done in a very subjective and inefficient manner, with experts in the
field defining what seem to them important aspects of health. We have
looked at a way to use cluster analysis to form health states in an
objective, data driven manner. We are currently working with two data
sets. One is from the RAND corporation, and consists of patients with
varying degrees of depression. The other is from Janssen
pharmaceutical corporation (a big drug company in New Jersey) and
involves schizophrenia patients. Our procedure in both cases goes
approximately like this: Patients in the study are asked questions
from a standard and well-validated psychiatric questionnaire which
obtains information about their health and general functioning. You
can think of each patient's responses as a point in n-dimensional
Euclidean space. Thus you get a cloud of points representing the
whole data set. However, it is very hard to work in this many
dimensions, and in fact the data cloud appears to be lower dimensional
than the number of items on the questionnaire. Thus you come up with
a smaller number of composite scores for the patients which summarize
the important dimensions in the data. (This can be done by principle
components or factor analysis, for those who are interested.) In our
depression data sets, we only really needed two dimensions, one for
physical health, and one for mental health. Based on the composite
scores, we look for the best groups in the data--i.e. places where the
data is ``clumped'' together. The particular technique we used is
called k-means cluster analysis. Basically, if you tell the algorithm
you want k clusters, it tries to divide the data into k groups
where the points are, on average, as close together as possible. Once
we have identified the best clusters, the medical experts can make up
health state descriptions based on the responses of people in the
clusters to the different items on the original questionnaire. This
turns out to be much more efficient (i.e. requires fewer health states
to adequately describe a patient population) than traditional methods,
and still produces clinically sensible health states. It also has the
advantage of being very empirical, and hence less subjective.
This project has given me a number of wonderful opportunities. For
instance, I got to actually go to Janssen for a few days to consult
and do some of the data analysis. The data we were using is from a
major clinical trial of their most successful schizophrenia drug,
rispiridone, and its estimated value is around $250 million dollars!
They did not want to let it off their premises... I also have been to
Toronto to give a talk about this project, and have several papers on
it in progress.
In addition, the project has sparked a number of theoretically
interesting statistical questions. For example, good methods are
known for implementing the k-means clustering algorithm, but there is
not an established mechanism for deciding what is the ``right'' number
of clusters. I am currently studying this and related issues.
Lest all these projects give you the impression that I don't have a
life, let me tell you some of my outside interests. I cook fairly
seriously (all sorts of ethnic foods), am a confirmed bookworm (I
especially like 19th century British lit., mysteries, and
fantasy/scifi), enjoy hiking and visiting state/national parks
(Yosemite is pretty close, but I don't get to visit it as often as I
would like--see the projects above!) and play the violin, viola, and
piano. Well, I guess to be honest, I really only play the violin at
the moment (I play in the Stanford Symphony) because I'm too busy...
 |