Sequoia Hall
Home Academics Research Seminars Consulting Industrial Affiliates People
    
Computing Guide
Our Brochure
Contact Us
Sitemap
Links



Student Research Samples

Previous Our Brochure Contents Next

We have put together some brief descriptions of student research. They are meant to provide a general view of the kinds of problems that are being tackled in the department. You will find out why our students chose a career in statistics rather than another field.

Sudeshna Adak
When I came to Stanford, I had an undergraduate and a masters degree in statistics. But, I found that I was not really prepared to jump into research and start writing my thesis. I had no idea what I wanted to work on. I spent my first year at Stanford wandering aimlessly and wondering if I would not have been better off in a ``real'' job. I also read a lot of books (on statistics, among other things) that seemed interesting. In those early days and even today, it always seems that the hardest part of research is finding a problem; one that is both interesting and that I think I can solve (with my capabilities). In this very undecided state, I went to Dr. Iain Johnstone and said that I wanted to work on nonparametric stuff. I did not have a clue as to what stuff, though I don't think I said that explicitly to Iain. I knew that Iain and David Donoho were working on wavelets and it seemed like a new and exciting area of research. This got me started reading (with Iain's help) about wavelets.

The aspect of wavelets that caught my interest was that it was being proposed as a better methodology in analyzing data - an alternative to more well-known and well-established methods. For instance, Iain and David had shown the usefulness of wavelet methods in nonparametric function estimation. Wavelets were also being used for their feature-localization properties where more standard methods could not be used. For instance, Fourier methods had been in use for a long time in studying the periodic behavior in signals. But Fourier analysis is useful in identifying only global frequencies. So, if a string is vibrating at a frequency of 5Hz, the Fourier transform of the resulting sound will show a peak at 5Hz (or near abouts). But, when the frequencies change over time, Fourier analysis has to be modified. For instance, if you were to say ``Hello, how are you ?'', then each syllable has its own signature frequency. These letters/syllables with their signature frequency are called phonemes. So, to use Fourier analysis, you have to divide up the sentence ``Hello, how are you?'' into its phonemes. So, how do you divide up a sentence like this into phonemes? This was the problem I started working on for my thesis. I then found that the same kind of problem existed in another category of signals - seismograms. Seismogram records of earthquakes also exhibit frequencies changing over time which can (I hope) tell the trained geophysicist a lot about the nature and location of the earthquake. I felt I had found a simple and interesting problem for my thesis.

To solve this problem of detecting local changes in the frequency patterns, I considered what is intuitively the natural solution. As the frequencies change over time, if you were to break up the signal into very small segments, then each tiny segment would be approximately homogeneous i.e. the frequency patterns should remain roughly constant within that time period. Consider a segment by comparing its left and right halves. If the frequency patterns of the two halves are similar to one another, then the two halves can be merged. Otherwise, the segment should be divided in half. So, thinking of this in mathematical terms, if and or every segmentation of the signal, we can define a measure where

[ represent a measure of distance between the frequency patterns. ] This measure quantifies how far the segmentation S is from the ideal one in which segment is homogeneous i.e. and are identical for every segment.

The next step is to devise an algorithm to search over all possible segmentations and minimize . But, this is computationally too intensive and so we restrict ourselves to searching over all segmentations that occur at dyadic points. And the rest is history.

Figure 1 shows the seismogram of the Loma-Prieta earthquake which hit San Francisco (and Stanford) in 1989. In Figure 2, I have shown the best (dyadic) segmentation tree. Figure 3 shows the relation between the frequency of vibration and time during the earthquake.


Figure 1: Loma-Prieta Earthquake--San Francisco (1989)

Figure 2: Best (dyadic) Segmentation Tree
Figure 3: Estimated Time-Dependent Spectrum

I spent the next year working on the finer points of the algorithm and still am. The best part of my experience at Stanford was the encouragement I received from Iain and other faculty members. Another aspect of the Stanford curriculum that I enjoyed very much is that after the first year (a very busy one), you are essentially free to take any course in any department. This gave me and many other graduate students as well a lot of freedom to take courses in other departments and find problems of interest to which we can apply our statistical methods.

One of the thesis-unrelated projects that I worked on was a gui (graphical-user interface) for a paper on wavelets - Exact Risk Analysis of Wavelet Regression (Marron et. al. (1996)). The gui is a nifty thing that lets the user click buttons to select one of several options, such as select a data set in the database and the particular wavelet basis that you wish to use. Once you have selected the options, the program automatically puts up the results on a screen. As a student in India, my computing experience had amounted to just a little more than logging in and out. And here was this whole new world. Writing the computer code for the gui (and for other similar guis) was a fascinating learning experience. (It felt a little like what it must have been when computers came into use).

I also spent a few weeks collaborating with Dr John Johnson and Dr. Daria Mochly-Rosen who were from the department of Molecular Pharmacology in the Stanford Medical School. The doctors had come to Iain with a problem in biostatistics and Iain brought it to my notice. Dr. Johnson and Dr. Mochly-Rosen were looking at the effect of prolonged phorbol ester treatment on the cardiac cells of neonatal rats. The phorbol (PMA) is known to activate the protein kinase C (PKC) which controls the rate of contraction of the heart. Different concentrations of PMA can effectively control the activity of the protein kinase C and through that the rate of contraction of cardiac cells. The ultimate objective is to obtain a better understanding of the protein kinase C in the modulation of cardiac functions which may reveal selective targets for therapies of cardiac disorders. The biostatistical problem at hand was to figure out the relationship between concentration of PMA and the contraction rate of the cardiac cells (of new-born rats). This project actually brought me in touch with ``real data'' where I could experiment with different models and statistical methodologies such as generalized linear models, bootstrap and nonparametric curve fitting.

I am now working as a post-doc in the Harvard School of Public Health (where I will be an assistant professor from next fall). I spent four wonderful years at the Department of Statistics at Stanford. I had the chance to work on some fascinating projects and I always had the feeling that I was learning a lot. And, I shall miss Stanford!

Gareth James
I have recently become interested in exploring the new developments in classification problems. A classification problem is where you are trying to use some input data to classify (or predict) to one of several possible groups. A good example is the US Post Office Zip Code Data. With this data you have the pixels from a scanned image of a hand written digit. The aim is to use the data from the pixels (ie position and shade etc) to predict the class of the image (i.e. 0,1,...,9). However classification problems abound in every field from medicine to finance.

Here is a simple example to illustrate how one might proceed in doing the classification. Consider the digit 2. Since different people write differently, the digit might arrive in many different shapes. The first step is to transform the digit into a standard form. The first two rows below illustrate how the original shape is transformed into a standard form which is then used to classify the digit.



The picture below shows "before-and-after" results for the digits 2,3,4.

There have been some exciting developments in this area recently. With old classification techniques being used in new more adaptive ways which produce far better results. However at present there is not really a clear understanding of why a number of these new methods work so well, which means there is plenty of scope for new research.

Until I arrived at Stanford I had only been exposed to more classical statistics (ANOVA, confidence intervals and hypothesis testing etc.) and it had not occurred to me that a statistician could provide any insight for this sort of problem. However I quickly discovered that statistics is a far more diverse field and that this is exactly the sort of thing that we are trained to tackle.

Catherine Sugar
Hi! My name is Catherine Sugar, and I am a third year PhD student. Actually, this is my fifth year at Stanford; I started doing a PhD in the mathematics department, but switched to statistics after two years. I really liked theoretical mathematics, and still do, but as time went on, I decided I really needed to do something with more immediate applications and implications for helping people. Statistics allows me to do mathematical research, some of it very abstract, and at the same time work on applied consulting projects. All in all, I am delighted with the decision I made to switch fields. The problems I am working on are exciting and so is the atmosphere in the statistics department.

My main statistical interests have to do with clustering and classification problems. I am particularly interested in applications to medical problems, and hope to get an academic position in biostatistics when I finish my degree. At the moment I am working on three different problems. The newest one has to do with identifying groups of amino acids in HIV proteins that tend to vary at the same time. If this can be done successfully, it will greatly increase understanding of how these proteins work, and perhaps what can be done to prevent adaptive mutations. I am just starting to work on this problem, so don't have results to report yet, but it looks very exciting.

I am also involved in a project with people in the nephrology division of the medical school who study auto-immune iseases of the kidney. People with these diseases can become extremely ill, and often require dialysis and transplants to survive if agressive measures are not taken early in the illness. Unfortunately, these treatments have many dangerous and unpleasant side effects. Furthermore, although some patients ``go to hell in a hand-basket,'' some deteriorate very gently and gradually, and others show symptoms, but retain normal kidney function for a very long time. Therefore, you really only want to use drastic measures on the patients whose condition will deteriorate rapidly. The trick is to predict, based on data on initial medical tests, which patients will fall into which of the three categories. Unfortunately, standard methods such as linear and quadratic discriminant analysis do not work well in this context. There are many, many initial variables, and very few data points. (In fact we only have 40 patients with full data, approximately equally split among the three groups.) As a result, the estimated covariance matrices needed to do discriminant analysis tend to be very unstable, or even singular. We are looking at ways to get around this problem using more robust ``shrinkage'' estimators. The paper that is the basis for this application was written by Jerry Friedman, a professor in our department, and is called Regularized Discriminant Analysis (JASA March 1989.) If the technique works well, it will have implications for many similar medical problems.

The project that has taken most of my time over the last few months is some collaborative work with people in the clinical pharmacology division. They are interested in assessing the quality of life of patients with various diseases, as a basis for making public policy decisions about how to spend medical resources. (This sort of work was the basis for the Oregon Health Care Initiative several years ago.) In order to measure quality of life, it is necessary to break the spectrum of patient conditions into discrete ``health states'' which the patients can evaluate. In the past, this division has been done in a very subjective and inefficient manner, with experts in the field defining what seem to them important aspects of health. We have looked at a way to use cluster analysis to form health states in an objective, data driven manner. We are currently working with two data sets. One is from the RAND corporation, and consists of patients with varying degrees of depression. The other is from Janssen pharmaceutical corporation (a big drug company in New Jersey) and involves schizophrenia patients. Our procedure in both cases goes approximately like this: Patients in the study are asked questions from a standard and well-validated psychiatric questionnaire which obtains information about their health and general functioning. You can think of each patient's responses as a point in n-dimensional Euclidean space. Thus you get a cloud of points representing the whole data set. However, it is very hard to work in this many dimensions, and in fact the data cloud appears to be lower dimensional than the number of items on the questionnaire. Thus you come up with a smaller number of composite scores for the patients which summarize the important dimensions in the data. (This can be done by principle components or factor analysis, for those who are interested.) In our depression data sets, we only really needed two dimensions, one for physical health, and one for mental health. Based on the composite scores, we look for the best groups in the data--i.e. places where the data is ``clumped'' together. The particular technique we used is called k-means cluster analysis. Basically, if you tell the algorithm you want k clusters, it tries to divide the data into k groups where the points are, on average, as close together as possible. Once we have identified the best clusters, the medical experts can make up health state descriptions based on the responses of people in the clusters to the different items on the original questionnaire. This turns out to be much more efficient (i.e. requires fewer health states to adequately describe a patient population) than traditional methods, and still produces clinically sensible health states. It also has the advantage of being very empirical, and hence less subjective.

This project has given me a number of wonderful opportunities. For instance, I got to actually go to Janssen for a few days to consult and do some of the data analysis. The data we were using is from a major clinical trial of their most successful schizophrenia drug, rispiridone, and its estimated value is around $250 million dollars! They did not want to let it off their premises... I also have been to Toronto to give a talk about this project, and have several papers on it in progress.

In addition, the project has sparked a number of theoretically interesting statistical questions. For example, good methods are known for implementing the k-means clustering algorithm, but there is not an established mechanism for deciding what is the ``right'' number of clusters. I am currently studying this and related issues.

Lest all these projects give you the impression that I don't have a life, let me tell you some of my outside interests. I cook fairly seriously (all sorts of ethnic foods), am a confirmed bookworm (I especially like 19th century British lit., mysteries, and fantasy/scifi), enjoy hiking and visiting state/national parks (Yosemite is pretty close, but I don't get to visit it as often as I would like--see the projects above!) and play the violin, viola, and piano. Well, I guess to be honest, I really only play the violin at the moment (I play in the Stanford Symphony) because I'm too busy...


Previous Our Brochure Contents Next


Contact  | Sitemap  | Directories  | Maps  & Directions  | Giving to Stanford
Copyright 2004Stanford University. All Rights Reserved. Stanford, CA 94305, (650) 723-2300
Terms of Use Copyright Complaints