Stat 315c: Learning from Transposable data

Overview

Very commonly in statistics we arrange our data into rows and columns. There are n rows corresponding to n IID observations and p columnns corresponding to p fixed variables, often split into predictors and one or more responses. The columns are named entities that we wish to study. The rows are anonymous, with no intrinsic interest, apart from what they tell us about the column variables. Lots of data does not fit this paradigm. Sometimes the specific rows in our data set are just as important as the colunns.

This type of problem is not new, but it is becoming more prevalent recently. Here are some examples of what the rows columns and entries are: The data can always be cast as triples (row ID, col ID, Val) which looks like a classic setup. But then the number of levels of each variable tends to grow with the sample size. We can reasonably expect the next batch of observations to bring a few new levels.

About the name

Some years ago I called this topic 'transposable data'. The reason is that both the data matrix and it's transpose can be looked at as having named columns and disposable rows used to learn about the columns, depending on your goals. You could analyze X on Monday, Wednesday, and Friday while looking at X' on Tuesday and Thursday. I'm open to suggestions for a better name. Maybe 'My Big Fat Data Matrix' will do. Many of the data sets are sparsely sampled and so fit the dyadic data framework. But many other data settings are not dyadic. There is overlap with "small n, big p" problems but just as often it's "big n, big p".

Who should take it?

This course is aimed at people who want to learn about methods, old and new, for large data sets with named rows and columns. It is also useful for people looking for a field with opportunities for new research problems.

Instructor

Art Owen
Sequoia Hall 130
My userid is owenbuzzard on stat.stanfordbuzzard.edu (remember to remove the carrion eaters)
Office hour: Wed 11:00

Classes

M 2:15-3:05 in 160-315 & W 2:15-3:30 in McCullough 1126

Starts Wednesday April 2

[Monday changed to avoid conflict with Stat 252, data mining]


Topics


Readings

There is no text on this material. There is a web page of research articles for background reading.

TA


Evaluation

Be sure to give Axess a working email address:
I expect to send a small number of important emails about problem sets and the homework there. Most other announcements will be made in class.
Late penalties apply:
We will count days late on each problem set. HW turned in on the due date but after class ends is one day late. The next day is two days late and so on. late if it is not turned in in class Each day late is penalized by 10% of the homework value. Homework more than 3 days late will ordinarily get 0. If you're travelling, you can email a pdf file. For sickness, interviews and other events, up to 3 late days total are forgiven at the end of the quarter. (Work late enough to get zero does not get redeemed though.)

Problems

Problems (passwd given in class)