Very commonly in statistics we arrange our data into rows and columns. There are n rows corresponding to n IID observations and p columnns corresponding to p fixed variables, often split into predictors and one or more responses. The columns are named entities that we wish to study. The rows are anonymous, with no intrinsic interest, apart from what they tell us about the column variables. Lots of data does not fit this paradigm. Sometimes the specific rows in our data set are just as important as the colunns.
This type of problem is not new, but it is becoming more prevalent recently. Here are some examples of what the rows columns and entries are:The data can always be cast as triples (row ID, col ID, Val) which looks like a classic setup. But then the number of levels of each variable tends to grow with the sample size. We can reasonably expect the next batch of observations to bring a few new levels.
- Terms, documents, counts in information retrieval
- Genes, experiments, expression levels in microarray analysis
- Movies, customers, ratings in recommender systems
- Students, questions, correctness in item response theory
- Web pages and other web pages in link analysis
- Varieties and fertilizers in crop science
Starts Wednesday April 2
[Monday changed to avoid conflict with Stat 252, data mining]
I expect to send a small number of important emails about problem sets and the homework there. Most other announcements will be made in class.Late penalties apply:
We will count days late on each problem set. HW turned in on the due date but after class ends is one day late. The next day is two days late and so on. late if it is not turned in in class Each day late is penalized by 10% of the homework value. Homework more than 3 days late will ordinarily get 0. If you're travelling, you can email a pdf file. For sickness, interviews and other events, up to 3 late days total are forgiven at the end of the quarter. (Work late enough to get zero does not get redeemed though.)