Unnaturally Long Attention Span

AvatarA blog about Grad School at Stanford, Working in a Silicon Valley Internet Company, and Statistical Machine Learning. mike AT ai.stanford.edu

Ranking Colleges Using Google and OSS

It's that time of year again when many people are deciding which college they should attend come fall. Whether they are a high school senior, aspiring professional, or seasoned veteran seeking an MBA, the ultimate decision of which college is based largely on reputation.

The perennial data source for college ratings that most students turn to is the US News America's Best Colleges Rankings. Despite the limited usefulness of reading rankings, there is only so much information you can get from first-hand experience. Taking the campus tour, talking with students and doing your own due diligence is about as far as you can go short of attending the school. The US News ranking methodology is based on many factors: peer assessment, faculty resources, selectivity, class sizes, and financial aid, among many others. What does it really mean to rank a list of schools? I can define a personal value function for determining which school I prefer to and by doing pairwise comparisons, I can sort a list of schools based on which would be the most beneficial to my personal goals. However, you can imagine though that an integration of these value functions over a large sample of students would pretty quickly create a ranking distribution that was rather flat. While the US News Rankings strive to give a complete overall view of the school, there are few weaknesses to its method. Some of these weaknesses have stirred criticism for the rankings by both students and colleges.

First, many of the metrics are subjective, such as peer assessment and selectively. This allows for "smoothing out" of the rankings to what the editors expect by "reinterpreting" the selectivity.

My major problem with the US News rankings, however, is that they are not free. In fact, only the top few schools rankings are viewable. Too see the whole list you have to pay $14.95.

So, to this end, I've decided to try my hand at generating my own rankings. Since I'm no expert in the field of evaluating colleges, I'm going to cheat and use statistical learning techniques. I'm going to do this with the help of just Google and some open source software. You won't even have to pay $14.95 to see the results!

First off, I found a list of American Universities from the Open Directory. I parsed out this page with a quick hand-written wrapper to get the names and URLs. Now, the fun part. What kind of "features" should I use for evaluating each school? This is where a bit of editorial control comes to bear. I wanted to capture the essence of what the US News methodology used, but I wanted to do this in a completely automated way using Google. So for each feature, I defined a Google Search query (shown in brackets) that would give me a rough approximation of that particular attribute:
  1. Peer assessment [link:www.stanford.edu] - This is how some search engines approximate "peer assessment", by counting the number of other pages citing you
  2. Size [site:www.stanford.edu] - a larger school would have a larger web, right? =)
  3. Number of faculty [dr. "home page" site:www.stanford.edu] - hopefully those professors have websites that mention "dr." and "home page"
  4. Scholarly Publications["Stanford University" in scholar.google.com]
  5. News mentions ["Stanford University" in news.google.com]
So then I just iterate the list of the schools and perform each of those queries using the Google API. Let it run for a few hours and I have all my data. Now, you may be thinking there's no way that 5 numbers can tell you everything you need to make a decision on a school. Well, let's take a look at what the data looks like.

First, I load the data into WEKA, a free, open source data mining software package in Java. It implements many off-the-shelf classification and regression algorithms with an API and GUI interface. Let's take a look at few slices of the data:
This figure plots newsMentions on the x-axis against scholarlyPublications on the y-axis. The points that are plotted are those 50 schools that have a score in the US News rankings (schools beyond the 50th place don't have an overall score in the US News rankings). The color of the dots goes from low(blue) to high(orange). The color trend is blue in the lower-left to orange in the upper-right. As you can see, not only are the two Google queries correllated, they seem to also be jointly correlated to the US News score.
Plotting numberFaculty against scholarlyPublications shows also a positive correlation. So maybe these queries weren't totally bogus and have some informational content.

The next step is to fit a statistical model to the training data. The basic idea is to train the model on the 50 colleges with USNews scores and to test the model on all 1700+ American colleges. The natural first step is to try to fit a 5-dimensional line through the space. The best fit line is
USNewsRank = (-0.0003)peerAssessment+(0)sizeWeb+(0.0063)numFaculty+(0)scholarlyPubs +
0.0002 * newsMentions+68.7534.
This simple model has a root mean squared error(RMSE) of 10.4223. So, in the linear model the size of the web and number of scholarly Publications don't play a role. Fair enough, but can we do better?

The answer is yes. I next used a support vector machine model with a quadratic kernel function. This gave me a RMSE of 7.2724 on the training data. The quadratic kernel allows for more complex dependencies in the training data to be modelled, which is why the training error is lower, but would this result in better evaluation on the larger data set? There is no quick answer to this, but we can see from the ouput what the model predicted.
Name USNews SVM
University of Washington 57.4 98.929
Yale University 98 98.081
Harvard University 98 97.953
Massachusetts Institute of Technology 93 92.996
Stanford University 93 92.922
National University
92.523
Columbia University 86 92.255
Princeton University 100 90.609
New York University 65 85.271
University of Chicago 85 85.052
Indiana University
83.973
University of Pennsylvania 93 83.91
Duke University 93 79.487
University of Southern California 66 78.645
University of Pittsburgh
78.274
Cornell University 84 78.051
University of Florida
77.864
University of Colorado
76.877
The American College
76.597
University of California, Berkeley 78 76.192

This table shows the top 20 scores given by my program, along side the US News rating, when available(i.e. when the school was in the top 50). As you can see, many schools recieved consistent marks across the two ratings. However, there are quite a few surprises. My program ranked University of Washington as the best school, where it was only ranked 57.4 by US News. Having visited UDub myself while I was working at Microsoft, I'm not completely surprised. It's a truly modern university that has recently been producing lots of good work--but let's not overgeneralize. I believe that "National University" being high in my rankings is a flaw in the methodology.There are probably many spurious matches to that query like, "Korea National University". Ditto for, "The American College". They just had fortunate names. However, I think the scores for other schools that were unranked by US News like University of Pittsburgh and Indiana University are legitimate.

Now, is this a "good" ranking?It's hard to say since there is no magic gold standard to compare it too. That is the problem with many machine learning experimental designs. But without squabbling too much over the details, I think this quick test shows that even some very basic statistics derived from Google queries can be rich enough in information that they can be used to answer loosely-defined questions. Although I don't think my program is going to have high school guidance counselers worried about their jobs anytime soon, it does do a decent job. An advantage of this approach is that it can be personalized with a few training examples to rank all of the schools based on your own preferences.

Extracting useful knowledge by applying statistics to summary data(page counts) is one thing, but I've taken it to the next level by actually analyzing the stuff within the HTML page. The result of that work is a project called Diffbot, and you can check it out here.

If you're interested, take a look at the full ranking of all 1720 schools(in zipped CSV format).

The Untimely Death of MusicSearch

You might have noticed that I have disabled my MusicSearch script. The page was up for a grand total of 3 days(1/13-1/15). During that time, it got over 3400 hits from around 11o0 unique visitors. That means the average person performed about 3 searches using the script. I'm not sure how it got so popular, but it seems that someone initially discovered my page through the blogspot network and posted it on French news site. From there, the promise of free music started spreading and my link started showing up at several other french, spanish, and greek sites as well as music forums. Some of those links are still there: (here, here, here). Since I don't read any of those languages, I don't really know what they are saying about my page. The Google Translation of the comments gives me this:
piranha - 15/01/2006 to 13:04:36 thank you for the bond! Nicolas - 16/01/2006 to 09:41:45 it was surely a good idea. When that functioned coconut - 17/01/2006 to 01:43:30 It was well yes, and there was also French song! Not like there bond
Although, my script later used caching, at some point the operator of the radio.blog network started to take notice of the traffic and sent me the following e-mail:
From: astro@mubility.com
To: mike@cs.stanford.edu
Hello,

I'm the owner of radio.blog.club.
What can I do that make you remove your script ?

What you do will bring us nowhere. When you make it easy to download MP3
from my website, you take both of us to an illegal level. One day or another, someone will ask to shut the site off because of this, and we will have no choice that take it down.

I'm not sure this is what you want. So PLEASE, remove your script.

Regards.

-astro
So, I have complied with astro's wishes, and removed my script, which was completely functioning at that time. Radio.blog's system encourages people to upload their music into publically accessible locations on the webserver. Their search engine at http://radioblogclub.com makes it easy to find anyone's radio.blog music.

Behind MusicSearch

Some people have been asking me how MusicSearch works and where the songs come from. Well, you know those embedded flash players that people put in their xanga/livejournal/myspaces that blast embarassing music everytime you try to visit your buddie's page in a library or coffeeshop? I took a look at the code for one of the popular ones, radio.blog. It turns out that the sound files are stored in an obvious, public directory. So, the MusicSearch that I wrote is just a simplified interface on the radio.blog search engine, which makes the public path to the audio file more explicit. The sound files are typically radio broadcast quality, 64kbps or lower, but give you a good sense of the song.

A New Home

I've started building a page at http://ai.stanford.edu/~mike. It's a work in early progress. Also mirrored at http://miketung.com

MusicSearch

Try out this thingy I just wrote : http://www.streetpricer.com/MusicSearch
Works only for Windows users with Windows Media Player(WMP). The songs are full tracks, not short clips, which you can download to your computer using "Save as..." in WMP. You'll notice that actually it generates an ASX playlist that plays in WMP. If you hit the "next track" button, it'll play other similar songs on that playlist.

UPDATE: If you can't get it working there are a few things you might need to do in WMP.
1. Turn off Shuffle in Play>
2. Turn on "Connect to the internet" in Tools > Options > Player
Hey, it only took me a few hours to write...

An Illusion


Count the number of men in the picture before and after it changes. Where does the extra man come from?