BTRY 694: Statistical Machine Learning

Tuesday, Thursday 11:40 - 12:55, Warren 145

Office Hours: 1186 Comstock, Wednesday 1 - 3.

Handouts, Homework and Announcements

A syllabus is now available.

Slides from lecture 1.

09/03/07 Homework 1 is now out and will be due September 18.

09/10/07 The Class Project/Competition has now started. Data is listed in the Data section below. The initial best MSE on the test set is 0.9403.

09/17/07 My office hours have moved one hour earlier, Wednesday 1 - 3. If you cannot attend these hours, please feel free to send me an e-mail to arrange an appointment.

09/20/07 Homework 2 is now out and will be due October 4.

10/11/07 Homework 3 is now out and will be due October 30.

30/10/07 Homework 4 is now out and will be due November 29.

New Netflix performance - 0.6273.

Data

Boston Housing Data and documentation.

HP Spam Data; there is a training set, test set and a documentation file.

California Housing Data and documentation.

Netflix Data

train_ratings_all.dat The ratings that the users in the training data set gave to each of the 99 movies.

train_dates_all.dat The date at which each of the ratings above were made.

train_ratings_nomiss.dat The training-set user ratings for the first 14 movies -- ie, where there are no missing values.

train_dates_nomiss.dat The corresponding dates for train_dates_nomiss.dat.

train_y_date.dat The dates at which the training set users rated "Miss Congeniality".

train_y_rating.dat The ratings that the users in the training set gave to "Miss Congeniality".

test_ratings_all.dat The ratings that the users in the test data set gave to each of the 99 movies.

test_dates_all.dat The date at which each of the ratings above were made.

test_ratings_nomiss.dat The test-set user ratings for the first 14 movies -- ie, where there are no missing values.

test_dates_nomiss.dat The corresponding dates for test_dates_nomiss.dat.

test_y_date.dat The dates at which the testing set users rated "Miss Congeniality".

movie_titles.txt Names and release dates for the 99 movies, given in the same order as the columns in the data above.

Readings and Resources

R Statistical Software

R-project website provides software and documentation for R.

Rintro a very basic "getting you started in R" tutorial. The data used in this tutorial is the Boston Housing Data.

Fox, Introduction to Statistical Computing in R: an online introduction to the basics of R.

Venables and Ripley, 2004, Modern Applied Statistics with S-plus, 4th Edition, Springer. Provides a good reference for the R and S-plus languages.

Perspectives on Statistics and Machine Learning

Friedman, Data Mining and Statistics: What's the Connection?, Jerry Friedman on why Data Mining is Statistics.

Breiman, Statistical Modeling: The Two Cultures. Leo Breiman on why Statistics should be Data Mining. This is a very provocative article, with some very interesting discussion.

Hand, Classifier Technology and the Illusion of Progress, David Hand claiming that data mining has not really payed off. Again a provocative article with some interesting discussion.

Data Set Selection is a marvelous satirical commentary on the practise of machine learning research. Awarded a special prize at NIPS 2003. The Journal of Machine Learning Gossip continues the tradition.

On Machine Learning/Data Mining

Introduction to Data Mining, a very quick overview put out by twocrows.

Vapnik, 2000, The Nature of Statistical Learning Theory, Springer. On the opposite end -- an extremely technical account of particular aspects of machine learning.

Lee, 2004, Bayesian Nonparametrics via Neural Networks, SIAM. A statistical perspective on one of the most commonly used tools.

Burges, 1998, A Tutorial on Support Vector Machines for Pattern Recognition. Kluwer. A nicely accessible introduction to SVMs from a CS point of view.

Also see texts used in the classes below.

Some Cute Applications

A very cute presentation on the use of image data. Thanks to Haim for finding this!

The Netflix Prize is a current opportunity for the highly inventive.

Other Classes at Cornell

A computer science list of courses (somewhat out of date).

CS 478 -- Machine Learning

CS 578 -- Empirical Machine Learning

MATH 774 -- Topics in Statistical Learning Theory

ORIE 474 -- Statistical Data Mining; a masters level course.

ORIE 674 -- this class is a trial replacement.

Note that CS also has several seminar series in machine learning and its applications.