August 2, 2006

Ars Tecnica

Mining the New York Times with machines

By Nate Anderson

The discipline of text mining took a step forward recently as a team from the University of California-Irvine used a new technique called "topic modeling" to sift 330,000 articles from the New York Times archive. The team's goal was to have their computers sort the stories by topic-without requiring any human training or intervention. Computers have trouble understanding large fields of unstructured text without guidance, but the new approach enables them to engage in some unsupervised learning that could soon pay huge dividends for academics, corporations, and government security programs alike.

Text mining, the extraction of information from a large cluster of documents, differs from simple searches. In a traditional search, the user knows what she is looking for (words like "isobutane" or "Cell processor"). In text mining, by contrast, users want to understand what's written in a series of documents or they want to see every corporate email dealing with legal matters. This requires the computer to parse the information and construct webs of topics and relationships between documents, something that goes far beyond a basic search.

This is where text mining software comes in. Unlike data mining, which is simpler for computers to handle, text mining has proved a tough nut to crack. Older approaches required lengthy training, but the virtue of the new topic modeling technique is that the computer can make sense of documents even when they contain information it has never seen before.

The UCI team managed this by programming their software to find patterns of words which occurred together in New York Times articles published between 2000 and 2002. Once these word patterns were indexed, the software then turned them into topics and was able to construct a map of such topics over time. The team's example is a set of words that tended to appear in the same article: "rider," "bike," "race," and "Lance Armstrong." The topic for this story would obviously be the Tour de France, and the software could use its word patterns to chart how often the bike race was discussed in the newspaper.

The new approach allows the computer to build its own topical database from a set of documents, and team researchers believe that it's a breakthrough for the field. "We have shown in a very practical way how a new text mining technique makes understanding huge volumes of text quicker and easier," said David Newman, a computer scientist at UCI. "To put it simply, text mining has made an evolutionary jump. In just a few short years, it could become a common and useful tool for everyone from medical doctors to advertisers; publishers to politicians."

With so much of the world's information stored as text, research organizations around the world have come to realize how important good text mining tools could turn out to be. The UK has a National Centre for Text Mining run by the universities of Manchester and Liverpool, and the journal Nature has pitched its own plan for an Open Text Mining Interface that would make it simpler for text mining software to process articles. Expect more text mining techniques to make their way out of the lab and into search engines over the next few years.
Mining the New York Times with machines

Media Inquiries
Media interested in interviewing ICS faculty, students or alumni should contact Matt Miller at or (949) 824-1562.