Mohegan SkunkWorks

"post"

Sat, 07 Nov 2009 09:49:51 EST

Fourth NYSA Machine Learning Seminar

Friday I attended the 4^th Machine Learning Symposium organized by the New York Academy of Sciences (NYSA).

The Symposium program consisted of four main talks given by local experts in the area of machine learning, interspersed with four graduate student talks, a poster session and a terrific lunch.

Since I'm not really hindered by any overwhelming expertise in this area I'll confine my self to a few breezy impressions of the main talks.

The first one was given by Bob Bell, from AT&T Bell Labs and a member of the team which won the Netflix prize.

What made the contest challenging was not only the huge size of the data set but also the fact that 99 % of the data was missing. In addition there were significant differences between training and test data. Regardless of whether a 10 % improvement of a movie rating system should be worthy of a million dollar prize, it provided a great way to test classifiers against real world data.

One thing that stood out for me was that a relative small amount of users was responsible for 'most' of the ratings. He mentioned that they identified one user responsible for 5400 ratings on one particular day. This could an data error on the Netflix side, where the time stamp was somehow misapplied. On the other hand it sounds like someone was trying to deliberately affect a large swath of ratings.

The final classifier incorporated breakthroughs made by different teams in the earlier stages of this multi-year competition.One such breakthrough was to consider the previous genres of the movies someone has rated to determine future recommendations. That must seem rather obvious in retrospect. The other was a clever way called Collaborative Filtering which takes into account the time-dependency of people's movie preferences.

An ensemble of previously validated classifiers was used to construct the final classifier and the calculation to get the final result submitted to Netflix took almost a month, primarily because a power failure forced a restart of the calculation engine. In fact the use of an ensemble of classifiers of mentioned as one of the main lessons learned from the contest. The other was the power matrix factorization (i.e. treating users and preferences as independent parameters and using matrix to link the two) as a computational tool.

Avrim Blum, from Carnegie Mellon followed with a discussion of a new clustering method he has discovered. Unsupervised learning is obviously in important area of machine learning. We don't alway have the benefit of fully analyzed training data. It would be a real breakthrough to have data 'speak for itself' in a meaningful way, without a priori constraints. Humans (or perhaps all animals in general) are good at classifying data across large conceptual categories. We obviously have no problem taking a pile of articles and splitting them into different piles based just on a few keywords.

One approach to detect clusters using a machine would be to find a point which is at a minimum distance from a set of data points. Clearly there would be many such points and each would correspond to the center of gravity of a cluster. Computationally this is a very hard problem(think the speaker mentioned it was in fact NP-hard.

From what understood (which is always a good qualification) Blum's work states that large data sets can always be broken down into clusters by considering a characteristic distance d_crit , such that all points within d_crit will belong to the same cluster and points belonging to the another cluster will be around 5 × d_crit away. This would partition the data set into (say) two clusters. You can break these down also, using the same algorithm (but obviously with different values of d_crit ). This way you end up with a hierarchy of clusters.

Ok, all done right ? Just tell me how to find d_crit and I'm on may way to construct the giant cluster break down of the universe. Unfortunately, prof Blum was silent on the actual construction of d_crit. So I think his result is that if we have evidence of a characteristic measure along the lines describes above, we in fact have a robust partitioning of the data space. That's an not insignificant result. In addition I think it shouldn't be too difficult to take his work and apply it to a large corpus of text and try to establish d_crit for it.

After lunch and the poster session, Thorsten Joachims from Cornell continued with a talk on the application of Support Vector Machines (SVM's) for predicting structured outputs. SVM's are a form of supervised learning where a characteristic data set is used to 'train' a classifier. A simple classifier would label data in only one of two ways. In this talk, SVM's are used to classify data across multiple categories. Using such an approach the results of ambiguous search terms for search engines (like SVM or Windows) could be grouped into categories of similar results. The breakthrough here I believe is a way to handle the computational complexity inherent in the use of SVM's for multi-classification. In addition the algorithm is structured such that only domain specific pieces need to be plugged in.

The last talk of the day was by Phil Long from Google gave on "On noise-tolerant learning using Linear classifiers". The speaker's uncompromising mathematical rigor obscured somewhat the obvious practical implications of his research, at least for me. On the other, the applause at the end of his talk appeared to me at least to be almost as boisterous as it had been sedate for previous speakers.This leaves me with the impression that at least the rest of the audience had thoroughly enjoyed his talk.

What I was able to rescue from my notes and memory was that noise in data can be identified by assuming that the distribution of noisy data points is in fact either not random, or does not follow the same distribution as the real data (so called malicious noise). In addition boosting schemes like AdaBoost, which focus on misclassified data can be very sensitive to noise.

All in all not a bad way to spent time away from the office while watching Yankee fan's lay siege to downtown Manhattan.

3466594191