Unnaturally Long Attention Span: 2006

Thursday, November 16, 2006 11

Using Google Calendar in Microsoft Outlook

I've recently started using the Google Calendar as my main calendaring application. It's neat because its accessible wherever I go, and you can easily add appointments from email messages if you use Gmail. However, when I'm using my notebook, I like to use Outlook for email so that I can keep copies of everything locally. So, I started searching for a way to integrate Outlook with Google Calendar, and found this [link: Incorporate Google Calendar in Outlook]. It's a nice solution, but it involves downloading pieces of M$ crapware (Visual Studio 2005 Tools for Office Runtime, Office 2003 Update: Redistributable Primary Interop Assemblies) and an Outlook plug-in.

I, not a huge fan of installing unnecessary stuff, found a much easier way to integrate GCalendar with Outlook that is enough to meet my needs and requires installing nothing.

Behold!

How it's done:

In Outlook, right-click on the folder you want the GCal link to be in and do "New Folder...". Call it whatever you like.
Right-click on your newly created folder and select "Properties...". Under the "Home Page" tab, put in the address of Google Calendar (http://www.google.com/calendar/render?pli=1) and select "Show home page by default for this folder."
Click ok, you're all done.

One drawback of this method, is that you are not really using the Outlook calendar, so its not really a solution for you folks in corporate world on MS Exchange. But if you're happy using Google as your main calendar store, this'll be fine!

Wednesday, September 20, 2006 0

When Disgruntled Laptops Attack Their Masters

So, this is what happened at work this morning. I think the pictures are pretty self-explanatory. Apparently, the smoke detectors in our 8-story tower are only for decorative purposes, since despite the thick black smoke and smell, nothing happened until someone manually pulled on the fire alarm. Comforting, eh?

On the other hand, my personal notebook is a Dell, too, that I've been using almost 4 years with no problems.

More coverage here:

Dell battery explodes at Yahoo HQ, hundreds evacuate (Engadget)

Flickr shots of the event

Valleywag

Thursday, May 11, 2006 4

The Right Cognitive Testbed for AI - Babies

The AI community has had a hard enough time defining what AI is, let alone defining milestones for achieving a functional AI. I believe that the obvious choice for funtional milestone is to achieve a functional AI equivalent of a newborn child. You might think this is a quite natural choice, but it differs from a lot of historical "milestones" of the AI community. An effective robotic baby is not going to help streamline your corporate environment, drive a war vehicle through enemy desert terrain, or handle urban assault situations. But then again, if you don't think a baby can cause mass destruction, you haven't spent enough time with one.

It's uncertain whether developing a functional baby cognitive model is the right direction towards human-level adult AI, but at least the progress is measureable, which can't be said for a lot of other approaches, such as animal models, games, or Turing-test-like setups. For example, just look at how the competition to create a Turing-test passing chat bot has turned out. Has creating a world champion chess computer advanced our knowledge at all of building a human-like AI? Not by much.

If you buy into my arugment so far that the baby model is the right approach, what does it actually involve? I will try to break down what I think the work in this track involves. I've citied the sources of the information at the end of this article.

This image shows how a baby's developing eye sees the world.

Below I have a timeline of an infant's cognitive development up to 1 year, and my own comments on what AI work is involved in emulating that functional behavior.

Between 1 and 2 months of age, infants become interested in new objects and will turn their gaze toward them. They also gaze longer at more complex objects and seem to thrive on novelty, as though trying to learn as much about the world as possible.

If you look at the image above, it suggests that during this time period, the sensors and the brain interface necessary to support them are still being constructed. An interesting cognitive feature--the ability to determine what is new--develops during this time. This ability to highlight "what is new" is the defining feature of what makes us alive. Basically, living things respond to changes, not to steady-states, so the central survival trait is the ability to detect and track changes. This feature is very complicated and affects us at many levels and deserves really its own discussion. At the lower level, this ability allows us to detect that predator lurking in the field or in the dark alley. At a higher level, why does that new song sound so good now, but so lame the next year?

The ability to detect changes also implies the ability to filter out what's old. i.e. pattern recognition. Old things are, by definition, things that fall into a pattern. So, I believe that the first step of AI is to have generalized pattern recognition(knowing what's old) and differencing(tracking the new changes).

At around 3 months of age, infants are able to anticipate coming events. For example, they may pull up their knees when placed on a changing table or smile with gleeful anticipation when put in a front pack for an outing.

The second cognitive ability that makes us living things, is an internal prediction engine. The prediction engine kicks in at 3 months, which is when the sensors finally start collecting reliable data. Prediction implies that there is an internal mental model of the world at this point, however primitive. There has actually been a lot of work that has been done on this component. We now have methods that can make predictions better than humans can. The key challenge, however, has always been in defining what are the inputs(how is this represented in the mind?) and outputs(how does this get translated into behavior?), and what is the structure of the prediction(does context play a role, and over what time periods?).

At around 4 months, babies develop keener vision. Babies' brains now are able to combine what they see with what they taste, hear, and feel (sensory integration). Infants wiggle their fingers, feel their fingers move, and see their fingers move. This contributes to an infant's sense of being an individual.

Sensory input development has finally stabalized and now we start refining the outputs(fingers and toes). Up until this point, we have not seen any fruits from our labors--there are no outputs! AI research has been stunted because there is so much upfront cost in developing a cognitive model, when the benefits(driving a war machine through enemy towns, translating natural languages) rely on the outputs. The point where a baby sees his own finger move and realizes what's going on is an important one. It's the point that completes the loop between sensors, internal model, and actuators and this loop creates a very powerful feedback cycle--Do something, predict the output, see the result, match it against the internal prediction, etc. This is the fundamental property of local optimization.

Between 6 and 9 months of age, synapses grow rapidly. Babies become adept at recognizing the appearance, sound, and touch of familiar people. Also, babies are able to recall the memory of a person, like a parent, or object when that person or object is not present. This cognitive skill is called object permanence.

In the last step, I hinted at some kind of learning going on, and this leads naturally to the development of a memory to store learned results. The key questions here are "what do you store?" and "what do you forget?". There have been many different approaches to answering the question of what to store. An approach that has been popularized by the press is that of creating a large "commonsense" database of knowledge that an AI can draw upon to do reasoning. The best example of this approach is the CyC project. However, I don't think this is compatible if we look at it in terms of developing a functional baby AI. Most people, not even adults, know the length of the Amazon river or the 25th president of the United States, so it seems that this type of knowledge is not a prerequisite for intelligence. A key feature of human cognition is the ability to forget, and these types of knowledge should be ones that a functional AI forgets(i.e. filters out).

Babies observe others' behavior around 9 to 12 months of age. During this time, they also begin a discovery phase and become adept at searching drawers, cabinets, and other areas of interest. Your baby reveals more personality, becomes curious, and demonstrates varied emotions.

This marks the point where the baby is able to acquire completely new pieces of knowledge on its own. I think this is the point where it is effectively an "adult" AI. At this point, the baby has enough capability to learn to be a rocket scientist or computer programmer. The AI equivalent, I think, is one that can learn by simply crawling, reading, and understanding the entire internet.

sources:
[1]Gizmodo-Seeing the world through the eyes of a baby
[2]Yahoo! Health-Cognitive development between 1 and 12 months of age

Monday, May 08, 2006 1

LED Letters!

Want to use a cool LED-looking font while fooling spammers? Read on..

The other day I was doing some work in Javascript, to try to fix some things in Diffbot, when I re-discovered a cool thing about element borders in HTML. Adjacent borders actually come together at a 45° angle in most browsers. Here's what I mean:

This is a div element with borders.

Now, if you take two of these blocks and simply stack them on top of each other, you get a pattern that resembles the LED "8":

Like its circuit-based cousin, these HTML LEDs consist of seven parts, which can be turned on or off to create a variety of characters. Having spent countless hours in the circuits lab during my undergrad working with these dreaded LEDs, I realized that now you could design an entire display system using this as a base--you could go as far as creating a scrolling stock ticker! I wrote a quick Javascript demo that turns any text into this form. To try it out, simply include led.js (less than 2k) and the following call to your html <body>

makeText("hello", parentElement);

Below you see an example output:

Try to select the above "hello" with your mouse--it's neither an image nor text.

The interesting thing about this is that you can use it to make text without actually having that text in the source code. This is great for preventing crawling robots and spammers from reading your text, while still allowing your human readers too see things fine. Some applications of this might be to cloak or email address, generate CAPTCHAs, or to do evil search engine optimization by hiding text from Googlebot. This method might be better than the straightforward method of rendering your text as images because it requires the robot/spammer to have

a javascript interpreter/browser
the ability to snapshot/render a certain region of the screen
Optical character like recognition capability

The image rendering obfuscation method, on the other hand, only requires #3. Obviously, a specific implementation can be defeated by reverse-engineering the html/javascript without these three components, but the resulting spamming algorithm would be implementation specific, which would not scale well for the spammer.

Monday, April 10, 2006 0

The Nerve Center that is SF

Here's a video where someone did an interesting thing. They graphed the locations of every Yellow Cab equipped with GPS in San Francisco over the course of a day. The intensity of the red represents how fast the cab is going.

http://clients.stamen.com/cabspotting/cabspotting_01.html

Doesn't this remind you of the videos of nerve firings in the brain?

Tuesday, March 14, 2006 2

Diffbot Invites

As I mentioned before, I'm launching a new website called Diffbot on April 1st. Diffbot is a cool new kind of web-based RSS reader and bookmark manager. Have a handful of sites that you read daily? Diffbot lets you know when those sites have updated and only shows you the portion that changed. It's also just a convenient place to put your bookmarks so that they are accessible wherever you go. This is still very much a work in progress, so we'd really appreciate any feedback on how we could do better. You can now signup to get an invite when it's ready!

Thursday, March 09, 2006 5

Engineering Software

Many of my technical readers out there have job titles that match the regular expression "(Sr.|Jr.)? Software Engineer (I)*". For the non-geek readers, that means we call ourselves "Software Engineers" :-). But, what would you consider the difference is between someone who is a "Software Engineer" and someone who's title is a "Computer Programer"? They seem like similar lines of work, yet one seems to imply a higher level of education, perhaps at least a 4-year college instead of a technical trade school. Historically speaking, there has been a huge difference between an engineer and a programmer. During the second World War, "computers" where actually women that caculated artilliary projections using desk calculators. These women became the first computer programmers when they were assigned to program the ENIAC, a room-sized computer with 18,000 vacuum tubes. Engineering, on the other hand, implied not the people that operated the machines, but those that designed the system and solved the larger problems.

Most major universities separate the School of Engineering from the School of Sciences. Engineering includes departments like chemical, mechanical, nuclear, bio and electrical engineering. Science includes departments like chemistry, physics, biology and computer science. Although, it seems like there's a lot of duplication here with the sciences, supposedly, this is because engineering has some common skillset. Engineering cirricula require a certain type of maths. Engineers usually study topics like design, tolerances, robustness, production processes, and technical writing.

However, let's return to what is software engineering? One of the most popular degrees that Software Engineers graduate with is Computer Science, which is not an engineering degree. What I'd like to argue--and this point has been made before--is that software engineering, is not only not taught in formal education, but is consequently different from the other types of engineering. This is why large companies recruiters complain about fresh college grads knowing nothing about debugging, why there are so many software internships, why most software engineering jobs require a few to several years of prior experience--its because that's when you actually learn some "software engineering"!.

However, I say software engineering is fundamentally different from other types of engineering, because the field itself is still in its prenatal state. Maybe that is one of the reasons why it hasn't been developed yet in formal education; we don't really know yet what are the best practices and fundamental formulas in software engineering. The level of engineering that goes into building even the largest software projects is nothing like the level of engineering that goes into building something like a car. It's more like the amount required to build a snowman. Even when I was working at Microsoft on the Windows source code, the Hoover Dam of software engineering, there was very little engineering in place to manage the complexity and uncertainty. To get a sense of this, compare the reliability of your office buiding to Microsoft Office. If your plumbing system or electric system failed as much as my MS Outlook or Firefox crashes, you would be a very unhappy camper. Yet for information workers, both things are equally important to their day.

Software engineering is a newer discipline than mechanical engineering, or even electrical engineering. Obviously, the software world is undergoing a very rapid change right now. We haven't had time yet to sit back and understand the principles and fundamental formulae that govern software. There are lots of well-specified problems, where there is no agreement on what is the best algorithm. Sure, there are small groups of people every studying problems like software reliability, static source code analysis to identify software weaknesses, and theoretic guarantees for software correctness and performance. I think these efforts will become increasingly important.

I've explained in my mind what the distinction between software and other forms of engineering are, but why do I think this is an important issue? There's no problem with working in a field that is largely unstructured, complex, and ad-hoc. That's part of the excitement of being in a brand new field. Life is great as a software engineer. The problem is that software is increasingly replacing the function of physical objects and electrical components. That is, computers are used to replace other things. Your typewriter has been replaced by your word processor. The control center of your car has been replaced by a small computer running an embedded operating system. Your telephone has been replaced by a small computer which emulates the phones functions. The stock market itself has been infused with tons of small programs, trading trillions of your dollars. Take the typewriter as an example, the mechanical engineer that designed it knows that unless the few joints between the key and the hammer fail, your keystroke will translate into a mark on the paper. The materials in the product have been carefully chosen with respect to their well known structural flexibility, strength, and mass. I won't even begin to explain all the things that could go wrong between the time you hit a key on your computer and see a letter appear on screen. In this "design", the components involved were chosen because they seem to work. It's crazy talk to try to estimate how reliable this design is. Yet, this fundamental unreliability is what we entrust to keep our airplanes in the air, our cars on the highway, our bank accounts and financial markets secure. It's just a matter of time before a catastrophic software failure occurs (many major ones already have), or we decide to design responsible software. Software that works as reliably as a toaster. Software that just doesn't break, no matter what the user does.

Most Popular Posts:
Ranking Colleges Using Google and Open Source

Tuesday, February 28, 2006 1

What is Web 2.0 ?

Everyone in the sphere seems to have an opinion on what Web 2.0 means to them, and I will add my own here. Web 2.0 is the new software tradition which we have all but transitioned to. The features of this new tradition that differentiate it from the older tradition are that products are more focused on design rather than on capability/features. Take for example products like Flickr, Firefox, MacOS X, Office 12, and the slew of AJAX service-based microapps. A good example to illustrate this paradigm shift is MS Word. In earlier versions of Word(1-7), each release added more capability to the editor evolving it from something like Notepad to the current Word 2003, while essentially keeping the interface consistent and familiar. However, if you look at Word 12, the first thing you will notice is a new interface focused on making tasks more efficient and discoverable. Even the data is stored in open formats with the goal of making it easy to consume and access by third parties.

Why did this shift occur and does this mean that software companies will eventually evolve into pure design companies? It's conceivable that with technology becoming more and more accessible and the wider availability of powerful tools, the software company of the future may be staffed almost completely with artists, psychologists, anthropologists, and designers, with maybe a few technical school graduates to write the tools.

And also, why is it that features and capabilities are less emphasized now? Aren't those the cornerstone of the computer revolution--being empowered by technology?

The truth is, the software industry is stuck in a rut. You can see it across all specialties--office productivity software, on the web, in gaming--no new features have been added, no new types of websites, no new gameplay, just more efficiency, more polygons, more psuedo-chrome. Since we have no new features to add, we have been keeping busy by making things pretty and usable, to keep us employed. Why this rut? Some might say that it is because the industry has entered a stage of evolution rather than revolution. We've reached a critical mass where now the improvements will be in small increments. I agree and also disagree. I agree that is the state of the industry, but I think the cause for no new features is simply that we have no new technology.

Technology as a whole, even outside of IT, has actually slowed down. Where are the Bell Labs of today? PARC is a shell of its former self. What are the new Information theories and quantum theories, new internets. Technology innovation has flatlined after it was made unecessary after we came out of wartime. The internet itself is a wartime child.

Okay, I've gotten a little too caught up and started rambling, but I think the solution to this technological rut is clear. We need more fundamental research. We've reached the limit on how far we can milk the results of past research. Whether or not there is a wartime neccessity, we need to do this basic research in order to improve the capabilities of our systems, to claim that things are still getting better.

So, what kind of new capabilities should be developed? Computers today are used almost solely to input, output, store, or transmit human data. But, instead of just being repositories and pipes for the data, I believe computers can consume and reason with data, much like a human can. How this can be implemented in our current market, I'll talk about later.

Tuesday, February 14, 2006 0

Ranking Freedom of Press

Here's another interesting ranking: the World Press Freedom Index. The list is topped by Denmark. At the bottom of the list is North Korea. The US? 44th. Another interesting data point is the United States of America (in Iraq) [sic] listed with rank 137. Defenders of freedom?

Sunday, February 12, 2006 1

Using Statistics to Uncover Human Rights Violations

The Human Rights Data Analysis Group , which used to be incubated under AAAS has just released a study they did on the analysis of the human rights violation datasets from Timor-Leste. I'm poring over their paper right now

The Profile of Human Rights Violations in Timor-Leste, 1974-1999
A Report by the Benetech Human Rights Data Analysis Group to the Commission on Reception, Truth and Reconciliation of Timor-LesteThe actual datasets are available for download as well. I encourage you to check them out(zipped CSV files):

the graveyard census database, called the Graveyard Census Database (GCD)
the fatal violations data from the Retrospective Mortality Survey (RMS)
the Multiple Systems Estimation (MSE) data file

Wednesday, February 08, 2006 0

Diffbot Launch Date

Some of you may know that I've been working in my spare time on a new type of news reader. Until now, the service has been a closed test involving about 50 users. Last week, though, Leith and I have finally decided on a launch date for a public beta. It's going to be April Fools Day, 2006. That gives us less than two months to get things tidy (and boy are there a lot of things!). Stay tuned...

Tuesday, February 07, 2006 1

Ask the Geek Grammar Lady #1

In the course of working with fellow geeks, I've noticed a few language usages that are unique to us geeks. One of these is the tendency to classify problems and issues as either "trivial" or "hard". Now, you have to understand, part of this standard geek colloquialism comes from the math tradition of formal proofs. Trivial is used to describe the obvious, non-interesting solution. "Hardness" could be a shortening of NP-hard, used to describe a class of problems that requires a lot of computation.

Next time you are tempted to use these labels, maybe consider some of the following more meaningful alternatives:

Instead of "This such-and-such problem is trivial"... consider replacing it with:

This such-and-such problem is easy to solve!
This such-and-such problem is small in scope.
This such-and-such problem is easy to talk about, but would require a team of grad students 10 months to implement.

Instead of "So, we all know such-and-such is hard"... consider replacing it with:

So, we all know such-and-such is computationally intensive.
So, we all know such-and-such is impenetrable.
So, we all know such-and-such is something I have no idea how to do.

Latest AIDS Cure Hoax

"Researchers believe they have found a new compound that could finally kill the HIV/AIDS virus, not just slow it down as current treatments do. While most of the community is still hesitant to comment on this until it passes peer review, initial results show that their method attacks and kills ALL variations of the virus. A fast track through the FDA could have one of the world's leading problems licked in less than a decade."

Funniest thing heard on Slashdot

Theoretically, Chuck Norris' tears could cure AIDS, cancer, paraplegia, herpes, common cold, mouth ulcers, and hangovers. Too bad that it is impossible to make Chuck Norris cry...

Saturday, February 04, 2006 0

Neurosurgery is Innate

Scientists have discovered a species of wasp, called Ampulex, that has evolved an ability to perform brain surgery on cockroaches. Not only that, but it has reverse-engineered the brain-physical map of the roach in order to control its movements, a feat which has only been performed by scientists in recent history. Seems like hacking is not only something that humans do.

The wasp slips her stinger through the roach's exoskeleton and directly into its brain. She apparently use ssensors along the sides of the stinger to guide it through the brain, a bit like a surgeon snaking his way to an appendix with a laparoscope. She continues to probe the roach's brain until she reaches one particular spot that appears to control the escape reflex. She injects a second venom that influences these neurons in such a way that the escape reflex disappears.

From the outside, the effect is surreal. The wasp does not paralyze the cockroach. In fact, the roach is able to lift up its front legs again and walk. But now it cannot move of its own accord. The wasp takes hold of one of the roach's antennae and leads it--in the words of Israeli scientists who study Ampulex--like a dog on a leash.

The zombie roach crawls where its master leads, which turns out to be the wasp's burrow. The roach creeps obediently into the burrow and sits there quietly, while the wasp plugs up the burrow with pebbles. Now the wasp turns to the roach once more and lays an egg on its underside. The roach does not resist. The egg hatches, and the larva chews a hole in the side of the roach. In it goes.

The larva grows inside the roach, devouring the organs of its host, for about eight days. It is then ready to weave itself a cocoon--which it makes within the roach as well. After four more weeks, the wasp grows to an adult. It breaks out of its cocoon, and out of the roach as well. Seeing a full-grown wasp crawl out of a roach suddenly makes those Alien movies look pretty derivative.

Friday, February 03, 2006 0

Update on China Censorship Situation

It appears now China has added http://google.cn to the "Great Firewall"

Tests on a Shanghai-based trace-route server, located at http://www.linkwan.com/vr2/, indicated that the site was being blocked at the government-operated backbone server. The analysis from the trace route said 'IP packets are being lost past network CHINANET backbone network at hop 4.'

Thursday, February 02, 2006 3

On Google Censoring China

Many people have been commenting on Google's recently implemented censorship of Chinese search results. If you search at google.cn from within China, certain sensitive queries such as "falun gong" or "democracy" will recieve censored results. Some have demanded that Google not comply with the Chinese government's request.

So, yesterday I was sitting in a small burrito shop on the corner of Rengstorff and Middlefield waiting for my order. I was scanning the local paper while observing the hyper-efficiency of the Mexican burrito assembly line. The reason I mention this is that a front page article featured a story of an American citizen's recent experience with the Chinese government. It really puts things in perspective.

Here's the full article. Some excerpts..

ET: Tell us about the trial.
Dr. Lee: The trial was conducted in such a way that I was denied every possible legal right. I have evidence which can prove my innocence and they did not allow me to present it. I tried to defend myself, but they did not allow me to do that, especially when I was talking about the reasons, why I was trying to reveal the truth of the persecution. They interrupted me many times; whenever I tried to speak, they would stop me. So I never got a chance to defend myself. And no evidence whatsoever was presented from me, even though I requested it many times.

...

ET: How were you treated?
Dr. Lee: When I was arrested in the very beginning, in Canton, and then was transferred to Guangzhou, they tortured me with handcuffs, which cut into my flesh to the bone; it was extremely painful. They did not allow me to sleep for 92 hours in total. They used the handcuffs as a tool to torture me. They pulled my arms upward, from the back, so it was extremely painful and I could not move at all. If I moved at all, the pain would get worse.
I conducted a hunger strike from the very beginning on January 22, 2003 until February 10—18 days in total.
I still insisted on practicing Falun Gong, so they handcuffed me on March 27, after the trial; the trial was March 21. They handcuffed me, and I started a hunger strike on the 30th, one more time. The reason I started was because I wanted to write an appeal letter, because the trial was unlawful, and I had to make three copies, plus one for myself, thus I had to write four copies of the appeal letter, and I was handcuffed.
On May 9 my appeal letter was denied, and I started a hunger strike again, because this was totally unlawful, so I wanted to protest again. After that, according to their law, that was the final decision, you had to be thrown into prison, so then I was transferred to Nanjing prison on the 12th of May. I was still doing a hunger strike to protest, and then on the 14th of May I received a phone call from the U.S. consulate and I said, "I have some materials over here, I want to give them to you. I will stop the hunger strike today, and if you do not get the materials in two weeks, I will start the hunger strike again." Because those materials are extremely important. They recorded all the details of the trial, what they (the authorities) said, what they did, why they did not allow me to do this and that, and also my appeal letters and how they tortured me. After two weeks, they did not send what I wanted, so I started a hunger strike again.
The forced feeding was extremely painful, and I resisted their treatment. So they just poked the big, thick tube into my nose and into my stomach. It was so irritating that I threw up several times. It was crazy—I was screaming as it was so painful. The cameraman, who was green—just from the police academy—he fainted, right at the scene. On June 2, the U.S. consulate received all the materials I wanted to give them.
They still had my Zhuan Falun [the central text for Falun Gong practitioners], and I said you need to give me this Zhuan Falun , or I won't stop the hunger strike. They started the force feeding again on June 3. Before they did, they yelled, "Taste the power of the people's democratic tyranny!" They were full of exultation before they were going to torture me, like it felt so good. They left the tubes in my nose, which I pulled out myself.
They used the forced feeding to torture me, actually.
They had these so-called group study sessions, anti-Falun Gong group study sessions. I was surrounded by 15 people, yelling at me. Falun Gong is this and that, so crazy, why do you believe in these things… calling me this every day. So I started a hunger strike again, because that was so bad. I was protesting against the torture.
I started a hunger strike again on July 14 to 18, and at that time they left the tube inside of my nose for 33 hours, and they had all of my body tied up on a bed. 33 hours I lied there without moving; it was extremely painful—agony.
Then came the brainwashing. From August, 2003 they started to force me to watch the videotapes. Every day, it was three hours of videotapes. Then they had condemnation meetings on Falun Gong, three-hour sessions; then the policemen would come and talk to me. This continued for three to four months, and they didn't see enough results, so they changed their strategy.
Starting at the end of 2003, they forced me to do slave labor, to make shoes. The shoes used a sort of industrial glue that contains benzene; it's very toxic and irritating, and I felt short of breath and had a headache, that sort of thing. The other thing is, I always tried to refuse to work, because I was innocent—I should not be doing this. They forced me to stand for 16 days, from morning to evening, stand in front of other prisoners; they verbally attacked me, insulted me, that kind of thing. If you didn't stand straight they would kick you and push you.
They tried to force me to confess to the crime, by using forced sitting. They say you have to sit there and think about your wrongdoings yourself, like repentance. You're forced to sit there, just sit there in a fixed position. The longest time was 48 days straight, with my heart problem surfacing as a result. After breakfast it was 7:30 a.m.; then you start sitting there, and at lunchtime you have lunch. Then you start sitting again after lunch, then have dinner, and then come back and sit there for some time and watch their CCTV (China Central Television). Every hour they gave me five minutes to walk around in the cell. The stool was this size [gestures to describes a stool less than a foot high, less than a foot wide, and around six inches deep], you sit there and then your bottom develops this hard callus—it's extremely painful, and your back is also very painful. I was exhausted for such a long time.
After maybe two to three weeks, my brain worked very slowly. When the consulate came, I found it very difficult to speak; somehow, my brain didn't work anymore!
ET: What was your day-to-day life like?
Dr. Lee: They have something for you all the time, like brainwashing sessions. It's always brainwashing, from the very beginning to the very end. They say you are a prisoner, you have committed a crime, that you have to be punished.

And this was an AMERICAN citizen.

Yeah, you don't want to mess with the Chinese government. I think what Google has done in order to bring their service to China is a positive step forward. They even display a message to the user when the results have been censored. Since many of the sensitive sites are filtered anyway by the Chinese government at the ISP level, showing the search results for those would have limited usefulness. Any change that we want to enact in China's humanitarian policies will have to happen at the governement level, not by companies.

Wednesday, February 01, 2006 3

Thoughts on Adsense

An early text ad

GMail and Google Maps may be incremental improvements to web based email and mapping, but the reason Google is as talked about today as it is is because of its success on Wall Street. Despite a few recent lackluster product releases, we give Google the benefit of the doubt because, hey, if they were able to more than quadruple their stock price then they must be pretty smart guys. The majority of it's financial success, Google owes, to Adsense, Google's little text ads. (They have recently gone into graphic ads as well)

How sound is the concept of Adsense? Imagine what your elevator pictch for Adsense would be back in the day. Is it as useful as being able to buy books at bargain prices online or trading my used stuff with others? Well, I can only speak personally, but I have recieved little, if any value from Adsense. I've never found anything "relevant" to what I'm looking for in an Adsense ad. I've definitely tried, and I still don't understand it. When I search for something on Google the most relevant results are the natural search results. That is the justification of the search results--The ads inherently cannot be in the best interest of the user and I've trained up my mental adblocking circuits to just simply ignore them.

But, Google continues to milk money from Adsense, now up to $1.9 billion in this last quarter alone. Who's buying this stuff? I think the proper analogy is that Google is cashing in on the online Gold Rush it has created. It's selling picks and shovels to the miners when there is no gold in the mountains to be found. In the American Gold Rush, the real winners, at least financially, were the hotel owners and city developers that provided the support framework to the many westward bound migrants. Of the composition of Adsense traffic, I would conjecture that the majority are "mistake clicks" that do not lead to information the user actually wants to see. The design of the Adsense ad itself is optimized for these mistake clicks. For example if you hover over the Adsense ads on the right, you'll see that the entire surface of the ad is a clickable area, even the empty space between the ads. This goes counter to the web convention that only underlined, blue text represents a hyperlink. So the common behavior of underlining text as you read it will invoke the ad, causing the poor advertiser who thinks he has bid his money on a legitimate user to his site, to pay money to Google for the user's action.

In addition to these "mistake clicks" others have estimated that up to 20% of Adsense clicks are malicious clicks, or spam clicks. Google would never admit to this being a serious problem, but the advertisers themselves have noticed by examining their webserver logs that much of the traffic coming in from these Adsense clicks is suspicious.

So, there seems to be many misconceptions at multiple levels in the Adsense system, both between the user, Google, and the advertiser. How long can the parties involved remain in the dark? John Battelle, in a recent interview, said that this misconception persists because despite the false clicks, advertisers still believe that they are receiving some net benefit from Adsense, in terms of new users to their site. However, John thinks that there is some point at which these benefits will be outweighed by the costs of traffic acquisition.

Where will Google be when that point comes?

Monday, January 30, 2006 36

Ranking Colleges Using Google and OSS

It's that time of year again when many people are deciding which college they should attend come fall. Whether they are a high school senior, aspiring professional, or seasoned veteran seeking an MBA, the ultimate decision of which college is based largely on reputation.

The perennial data source for college ratings that most students turn to is the US News America's Best Colleges Rankings. Despite the limited usefulness of reading rankings, there is only so much information you can get from first-hand experience. Taking the campus tour, talking with students and doing your own due diligence is about as far as you can go short of attending the school. The US News ranking methodology is based on many factors: peer assessment, faculty resources, selectivity, class sizes, and financial aid, among many others. What does it really mean to rank a list of schools? I can define a personal value function for determining which school I prefer to and by doing pairwise comparisons, I can sort a list of schools based on which would be the most beneficial to my personal goals. However, you can imagine though that an integration of these value functions over a large sample of students would pretty quickly create a ranking distribution that was rather flat. While the US News Rankings strive to give a complete overall view of the school, there are few weaknesses to its method. Some of these weaknesses have stirred criticism for the rankings by both students and colleges.

First, many of the metrics are subjective, such as peer assessment and selectively. This allows for "smoothing out" of the rankings to what the editors expect by "reinterpreting" the selectivity.

My major problem with the US News rankings, however, is that they are not free. In fact, only the top few schools rankings are viewable. Too see the whole list you have to pay $14.95.

So, to this end, I've decided to try my hand at generating my own rankings. Since I'm no expert in the field of evaluating colleges, I'm going to cheat and use statistical learning techniques. I'm going to do this with the help of just Google and some open source software. You won't even have to pay $14.95 to see the results!

First off, I found a list of American Universities from the Open Directory. I parsed out this page with a quick hand-written wrapper to get the names and URLs. Now, the fun part. What kind of "features" should I use for evaluating each school? This is where a bit of editorial control comes to bear. I wanted to capture the essence of what the US News methodology used, but I wanted to do this in a completely automated way using Google. So for each feature, I defined a Google Search query (shown in brackets) that would give me a rough approximation of that particular attribute:

Peer assessment [link:www.stanford.edu] - This is how some search engines approximate "peer assessment", by counting the number of other pages citing you
Size [site:www.stanford.edu] - a larger school would have a larger web, right? =)
Number of faculty [dr. "home page" site:www.stanford.edu] - hopefully those professors have websites that mention "dr." and "home page"
Scholarly Publications["Stanford University" in scholar.google.com]
News mentions ["Stanford University" in news.google.com]

So then I just iterate the list of the schools and perform each of those queries using the Google API. Let it run for a few hours and I have all my data. Now, you may be thinking there's no way that 5 numbers can tell you everything you need to make a decision on a school. Well, let's take a look at what the data looks like.

First, I load the data into WEKA, a free, open source data mining software package in Java. It implements many off-the-shelf classification and regression algorithms with an API and GUI interface. Let's take a look at few slices of the data:

This figure plots newsMentions on the x-axis against scholarlyPublications on the y-axis. The points that are plotted are those 50 schools that have a score in the US News rankings (schools beyond the 50th place don't have an overall score in the US News rankings). The color of the dots goes from low(blue) to high(orange). The color trend is blue in the lower-left to orange in the upper-right. As you can see, not only are the two Google queries correllated, they seem to also be jointly correlated to the US News score.

Plotting numberFaculty against scholarlyPublications shows also a positive correlation. So maybe these queries weren't totally bogus and have some informational content.

The next step is to fit a statistical model to the training data. The basic idea is to train the model on the 50 colleges with USNews scores and to test the model on all 1700+ American colleges. The natural first step is to try to fit a 5-dimensional line through the space. The best fit line is

USNewsRank = (-0.0003)peerAssessment+(0)sizeWeb+(0.0063)numFaculty+(0)scholarlyPubs +
0.0002 * newsMentions+68.7534.

This simple model has a root mean squared error(RMSE) of 10.4223. So, in the linear model the size of the web and number of scholarly Publications don't play a role. Fair enough, but can we do better?

The answer is yes. I next used a support vector machine model with a quadratic kernel function. This gave me a RMSE of 7.2724 on the training data. The quadratic kernel allows for more complex dependencies in the training data to be modelled, which is why the training error is lower, but would this result in better evaluation on the larger data set? There is no quick answer to this, but we can see from the ouput what the model predicted.

Name USNews SVM

University of Washington 57.4 98.929

Yale University 98 98.081

Harvard University 98 97.953

Massachusetts Institute of Technology 93 92.996

Stanford University 93 92.922

National University
92.523

Columbia University 86 92.255

Princeton University 100 90.609

New York University 65 85.271

University of Chicago 85 85.052

Indiana University
83.973

University of Pennsylvania 93 83.91

Duke University 93 79.487

University of Southern California 66 78.645

University of Pittsburgh
78.274

Cornell University 84 78.051

University of Florida
77.864

University of Colorado
76.877

The American College
76.597

University of California, Berkeley 78 76.192

This table shows the top 20 scores given by my program, along side the US News rating, when available(i.e. when the school was in the top 50). As you can see, many schools recieved consistent marks across the two ratings. However, there are quite a few surprises. My program ranked University of Washington as the best school, where it was only ranked 57.4 by US News. Having visited UDub myself while I was working at Microsoft, I'm not completely surprised. It's a truly modern university that has recently been producing lots of good work--but let's not overgeneralize. I believe that "National University" being high in my rankings is a flaw in the methodology.There are probably many spurious matches to that query like, "Korea National University". Ditto for, "The American College". They just had fortunate names. However, I think the scores for other schools that were unranked by US News like University of Pittsburgh and Indiana University are legitimate.

Now, is this a "good" ranking?It's hard to say since there is no magic gold standard to compare it too. That is the problem with many machine learning experimental designs. But without squabbling too much over the details, I think this quick test shows that even some very basic statistics derived from Google queries can be rich enough in information that they can be used to answer loosely-defined questions. Although I don't think my program is going to have high school guidance counselers worried about their jobs anytime soon, it does do a decent job. An advantage of this approach is that it can be personalized with a few training examples to rank all of the schools based on your own preferences.

Extracting useful knowledge by applying statistics to summary data(page counts) is one thing, but I've taken it to the next level by actually analyzing the stuff within the HTML page. The result of that work is a project called Diffbot, and you can check it out here.

If you're interested, take a look at the full ranking of all 1720 schools(in zipped CSV format).

Wednesday, January 18, 2006 7

The Untimely Death of MusicSearch

You might have noticed that I have disabled my MusicSearch script. The page was up for a grand total of 3 days(1/13-1/15). During that time, it got over 3400 hits from around 11o0 unique visitors. That means the average person performed about 3 searches using the script. I'm not sure how it got so popular, but it seems that someone initially discovered my page through the blogspot network and posted it on French news site. From there, the promise of free music started spreading and my link started showing up at several other french, spanish, and greek sites as well as music forums. Some of those links are still there: (here, here, here). Since I don't read any of those languages, I don't really know what they are saying about my page. The Google Translation of the comments gives me this:

piranha - 15/01/2006 to 13:04:36 thank you for the bond! Nicolas - 16/01/2006 to 09:41:45 it was surely a good idea. When that functioned coconut - 17/01/2006 to 01:43:30 It was well yes, and there was also French song! Not like there bond

Although, my script later used caching, at some point the operator of the radio.blog network started to take notice of the traffic and sent me the following e-mail:

From: astro@mubility.com
To: mike@cs.stanford.edu
Hello,

I'm the owner of radio.blog.club.
What can I do that make you remove your script ?

What you do will bring us nowhere. When you make it easy to download MP3
from my website, you take both of us to an illegal level. One day or another, someone will ask to shut the site off because of this, and we will have no choice that take it down.

I'm not sure this is what you want. So PLEASE, remove your script.

Regards.

-astro

So, I have complied with astro's wishes, and removed my script, which was completely functioning at that time. Radio.blog's system encourages people to upload their music into publically accessible locations on the webserver. Their search engine at http://radioblogclub.com makes it easy to find anyone's radio.blog music.

Sunday, January 15, 2006 8

Behind MusicSearch

Some people have been asking me how MusicSearch works and where the songs come from. Well, you know those embedded flash players that people put in their xanga/livejournal/myspaces that blast embarassing music everytime you try to visit your buddie's page in a library or coffeeshop? I took a look at the code for one of the popular ones, radio.blog. It turns out that the sound files are stored in an obvious, public directory. So, the MusicSearch that I wrote is just a simplified interface on the radio.blog search engine, which makes the public path to the audio file more explicit. The sound files are typically radio broadcast quality, 64kbps or lower, but give you a good sense of the song.

A New Home

I've started building a page at http://ai.stanford.edu/~mike. It's a work in early progress. Also mirrored at http://miketung.com

Thursday, January 12, 2006 2

MusicSearch

Try out this thingy I just wrote : http://www.streetpricer.com/MusicSearch
Works only for Windows users with Windows Media Player(WMP). The songs are full tracks, not short clips, which you can download to your computer using "Save as..." in WMP. You'll notice that actually it generates an ASX playlist that plays in WMP. If you hit the "next track" button, it'll play other similar songs on that playlist.

UPDATE: If you can't get it working there are a few things you might need to do in WMP.
1. Turn off Shuffle in Play>
2. Turn on "Connect to the internet" in Tools > Options > Player
Hey, it only took me a few hours to write...

An Illusion

Count the number of men in the picture before and after it changes. Where does the extra man come from?

Unnaturally Long Attention Span