August 8, 2006

The Orange County Register

'Data miners' at UCI moving beyond Google

By Colin Stewart

To Google or not to Google - that is not really a question any more.

Of course you Google, unless you don't use the Internet at all.

The top-ranked Internet search engine will help you find whatever you're hunting for: Birkenstock Arizona sandals, show times for "Talladega Nights," how a garbage disposal works.

But what happens if you're not quite sure what you're seeking?

Then you might want to turn to computer scientist David Newman and his colleagues at UC Irvine, who have helped develop software that searches huge expanses of text without being told what to find.

This form of "text mining" uses a technique, called statistical topic modeling, that is likely to have implications far beyond Internet searches. It can be used by marketers who want to study cultural trends, historians exploring the roots of modern society, doctors confronted with mountains of medical research, and intelligence agents analyzing masses of e-mail traffic among possible terrorists.

Topic modeling organizes data into categories by tracking and tabulating words that appear together frequently. For the computer user, scanning those categorized results is like browsing through a bookstore instead of ordering a particular book online.

"To put it simply, text mining has made an evolutionary jump," Newman said. "In just a few short years, it could become a common and useful tool for everyone from medical doctors to advertisers, from publishers to politicians."

Topic modeling, which is also being developed by computer scientists at several research universities, has not yet been adopted by established data-search companies. But within a few years, Newman predicts, it will be in use outside an academic setting.

"When people first hear about it, they think, 'Oh, yeah. It's Google,' " said UCI history professor Sharon Block, who has used Newman's program in her research. "But it's really '$10,000 Pyramid.' "

In "Pyramid," a game show that started airing in 1973, celebrities shouted words to contestants who tried to name a category that the words had in common. In topic modeling, the computer lists the words in a category, but people still need to come up with a name for the category.

Programming the computer to choose a meaningful label for each category is a subject of ongoing research, Newman said. Topic modeling can be used not only in different fields but also in different ways - to spot trends, to organize unfamiliar data, or to hunt for unnoticed connections.

Spotting trends: Advertisers, marketers and publishers could learn from the patterns of rising and falling interest in football, bicycling, the Oscars and corporations' quarterly earnings that were apparent in Newman's latest research. He used topic modeling to analyze 330,000 newspaper stories, mostly from The New York Times. The program created categories from frequently related words and names of people, places and organizations.

His research demonstrated that, measured by the number of words the Times devoted to these topics from 2000 to 2002:

- The popularity of pro football increased, from a yearly maximum of about 25,000 words a month to 40,000 words a month.

- Interest in the Tour de France declined slightly during that period, from a peak of about 14,000 words per month to about 12,000.

- Discussions of the Oscars nearly doubled from 2001, when the winner for best picture was "Gladiator," to the next year, when the Oscar went to "A Beautiful Mind."

- Interest in corporate earnings was highest in 2001, when the dot-com boom was crumbling.

Historian Block, who is Newman's wife, used topic modeling to spot trends in 82,000 articles and ads in the Pennsylvania Gazette from 1728 to 1800, including the time when Benjamin Franklin owned the newspaper.

Among her findings, she noted that discussions related to religion dropped when writing about fashion and trade increased - and vice versa. The 1750s were the peak for the category "cloth," including the words "worsted," "silk," "linen," "fine" and "thread" and the low point for discussions using words such as "church," "virtue" and "character."

Organizing unfamiliar data: Researchers have used topic modeling to analyze the 250,000 e-mails that Enronsurrendered to the U.S. Department of Justice.

Newman said it could help users of the Google Library Project, which aims to create a digital record of millions of books from libraries at Stanford, Harvard and Oxford universities, the New York Public Library and elsewhere.

National security agencies could also use topic modeling to organize large masses of uncategorized data, which explains why Newman presented his findings at May's Intelligence and Security Informatics conference in San Diego.

Hunting for new connections: UCI computer scientists will help professors at UCI Medical School track down studies that could be linked to their research into schizophrenia.

They will use topic modeling on a database of 17 million medical research papers to find research that links genes and the brain regions that are involved in schizophrenia.

Because of how topic modeling works, they expect to find research that is potentially useful even if the studies aren't directly related to schizophrenia - and don't even mention the disorder.
'Data miners' at UCI moving beyond Google

Media Inquiries
Media interested in interviewing ICS faculty, students or alumni should contact Matt Miller at or (949) 824-1562.