CS 221 - Information Retrieval

CS 221 Information Retrieval

Homework Projects

Paper Summaries

Syllabus

Academic Honesty

Students with Disability

Synopsis

Purpose. An introduction to information retrieval including indexing, retrieval, classifying, and clustering text and multimedia documents.

Book. Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze

Evaluation. Homework/lab projects (1/2) + Summaries (1/4) + Quizzes (1/4)

Pedagogy:
- Lectures cover the material in the reading materials by placing it in context, giving examples, and engaging in Q&As.
- Homework projects are hands-on vehicles for learning the material. Collaboration and knowledge exchange are encouraged in the projects, but mindless copy of solutions (aka cheating) is not allowed.
- Papers cover the foundations of the field of Information Retrieval, right from the original source.

Instructor: Prof. Cristina Lopes, DBH 5076, lopes at ics dot uci dot edu
Reader: TBD

Lectures: Tue & Thu 9:30-10:50am, ICS 174
Office hours: Mondays and Wednesdays, 11am-12pm, ICS 408

Projects

Project descriptions

There will be 3 projects, the last one with several milestones. Projects are due by midnight on the due date. Late projects will be accepted with penalties.

Submission

See instructions in each project description.

Important dates

Assignment	Topic	Due date	Weight
1	Text processing	1/20	20%
2	Web crawling	2/3	20%
3	Search Engine	2/17, 3/3, 3/15	60%

Quizzes

There will be 4 quizzes throughout the course. Quizzes are on Tuesdays during the lecture. They cover material that has been taught the previous weeks since the last quizz. The quiz with the worst score will be discarded. No quiz make-ups.

Quiz	Date
1	1/22
2	2/5
3	2/19
4	3/5

Paper Summaries

Summaries are due Fridays. Summaries submitted up to one week late will have a penalty of 35%. No summaries will be accepted past 1 week of their due date.

Each article should be summarized in no more than one page, with the following structure: (a) objective summary of the article (do not inject your views here, be objective); (b) short personal commentary about the article (your views here).

Submit one pdf file per week with all the summaries for that week on that file.

Please name your paper summary files like this:
LastName_WeekNumber.pdf
starting with WeekNumber=1 for the first week.
Files that don't follow this convention may be missed by the instructors.

Include your full name and student ID in the summary itself.

Turn in summaries in EEE Dropbox.

Syllabus:


Week	Date	Topic	Weekly materials	Deliverables	Notes
1	1/8	Web Search Basics	Textbook Chapter 19: Web Search Basics (no need to summarize) 1. Wikipedia entry on Vannevar Bush 2. "As We May Think" The Atlantic Monthly, July, 1945. (reprinted in ACM CHI Interactions, March 1996)	Summaries	Slides
	1/10				The Web Slides
2	1/15	Text Processing Search Engine Optimization	3. "Stuff I've seen: A system for personal information retrieval and re-use " by S. Dumais, E. Cutell, J. Cadiz, G. Jancke, R. Sarin, and D. Robbins, SIGIR, 2003 Commentary: "This paper addresses an increasingly important problem - how to search and manage personal collections of electronic information. ... it addresses an important user-centered problem. ...this paper presents a practical user interface to make the system useful. ..., the paper includes large scale, user-oriented testing that demonstrates the efficacy of the system. ..., the evaluation uses both quantitative and qualitative data to make its case. I think this paper is destined to be a classic because it may eventually define how people manage their files for a decade. Moreover, it is well-written and can serve as a good model for developers doing system design and evaluation, and for students learning about IR systems and evaluation." 4. "Simple, Proven Approaches to Text Retrieval" by Robertson and Jones Commentary: "This paper provides a brief but well informed and technically accurate overview of the state of the art in text retrieval, at least up to 1997. It introduces the ideas of terms and matching, term weighting strategies, relevance weighting, a little on data structures and the evidence for their effectiveness. In my view it does an exemplary job of introducing the terminology of IR and the main issues in text retrieval for a numerate and technically well informed audience. It also has a very well chosen list of references."	Summaries	Slides Slides
	1/17				Slides Slides
3	1/22	Web crawling	Textbook Chapter 20 : Web Crawling and Indices (no need to summarize) 5. "The Web As a Graph" by R. Kumar, P Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal, PODS 2000 Abstract: "The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph has about a billion nodes today, several billion links, and appears to grow exponentially with time. There are many reasons -- mathematical, sociological, and commercial -- for studying the evolution of this graph. We first review a set of algorithms that operate on the Web graph, addressing problems from Web search, automatic community discovery, and classification. We then recall a number of measurements and properties of the Web graph. Noting that traditional random graph models do not explain these observations, we propose a new family of random graph models."	Summaries	Slides
	1/24				Slides More
4	1/29*	Web Crawling	On Tuesday, 1/29 Prof. Chen Li will share his experiences of doing search-related research and commercializing the results. The work is mainly conducted in the iPubmed project (http://ipubmed.ics.uci.edu), which can support instant, fuzzy search on more than 21 million medical publications. He has been doing a startup, called SRCH2, to commercialize the techniques. The company has spent the last few years developing a new full text search software from the ground up. Here are some of the things SRCH2 can do: it resides wholly in-memory, deploys multi-threaded queries, uses cached forward indexing to enable instant recommendations, does rapid geo search, error correction, customizable rankings, with real-time updates, all in parallel, in scale. Each feature addresses significant pain points for our growing list of enterprise mobile, social, and e-commerce clients. He will share experiences of doing research commercialization. 6. "How Google Code Search Worked " by Russ Cox (January 2012) Commentary: Google code search has been a great resource for developers, but it has just been shut down. This blog post explains how it worked.	Summaries	*invited lecturer
	1/31				Slides
5	2/5	Index Construction and Scoring	Textbook Chapter 4 : Index Construction 7. The unreasonable effectiveness of data Commentary: Three Google researchers summarize the benefits of data-driven problem-solving in an essay that borrows the title from another famous paper that proposes the opposite.	Summaries	Slides
	2/7				Compression MapReduce Hadoop
6	2/12	Querying, Scoring, Term Weighting and the Vector Space model	Textbook Chapter 1 : Boolean Retrieval Textbook Chapter 6 : Scoring, term weighting & the vector space model 8.A vector space model for automatic indexing by Salton, Wong, Yang	Summaries	Slides Slides
	2/14				Slides
7	2/19	Search Engine Evaluation Vector Space Model	10. "Map Reduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat Commentary: the paper that revolutionized modern data processing, made "cloud computing" trendy, and a great example of how programming language concepts can be applied to the design of real systems.	Summaries	Slides
	2/21				Slides
8	2/26	Link Analysis	Textbook Chapter 21 : Link Analysis 9. "The Anatomy of a Large-Scale Hypertextual Web Search Engine" by S. Brin and L. Page (this link is to the long version, the short version was publishied in WWW1998) Commentary: "This paper (and the work it reports) has had more impact on everyday life than any other in the IR area. A major contribution of the paper is the recognition that some relevant search results are greatly more valued by searchers than others. By reflecting this in their evaluation procedures, Brin and Page were able to see the true value of web-specific methods like anchor text. The paper presents a highly efficient, scalable implementation of a ranking method which now delivers very high quality results to a billion people over billions of pages at about 6,000 queries per second. It also hints at the technology which Google users now take for granted: spam rejection, high speed query-based summaries, source clustering, and context(location)-sensitive search. IR and bibliometrics researchers had done it all (relevance, proximity, link analysis, efficiency, scalability, summarization, evaluation) before 1998 but this paper showed how to make it work on the web. For any non-IR engineer attempting to build a web-based retrieval system from scratch, this must be the first port of call."	Summaries	Slides
	2/28
9	3/5	Matrix decompositions and latent semantic indexing	Textbook Chapter 18 : Matrix Decompositions and latent semantic indexing Additional tutorial on LSA, with code 11. "Indexing by latent semantic analysis" by (Deerwester, Dumais, et.al) Commentary: " IR, as a field, hasn't directly considered the issue of semantic knowledge representation. The above paper is one of the few that does in the following way. LSI is latent semantic analysis (LSA) applied to document retrieval. LSA is actually a variant of a growing ensemble of cognitively-motivated models referred to by the term "semantic space". LSA has an encouraging track record of compatibility with human information processing across a variety of information processing tasks. LSA seems to capture the meaning of words in a way which accords with the representations we carry around in our heads. Finally, the above paper is often cited and interest in LSI seems to have increased markedly in recent years. The above paper has also made an impact outside our field. For example, recent work on latent semantic kernels (machine learning) draws heavily on LSI. "	Summaries	Slides
	3/7
10	3/12	Matrix decompositions and latent semantic indexing	Textbook Chapter 8 : Evaluation in Information Retrieval 12. " Unsupervised Named-Entity Extraction from the Web: An Experimental Study " (Etzioni, et.al.) Commentary: "This paper represents a new generation of IR work that attempts to do more than build a bag of words for information retrieval, but also attempts to make some sense of the information as well."	Summaries	Slides Slides
	3/14

Exam: no exam

Academic Honesty

I trust all students are honest and do not cheat. Those who break my trust at any point will get an F in the course - no excuses or apologies will be accepted.Additional penalties may also be imposed by the department and the university. Very severe incidents of academic dishonesty can result in suspension or expulsion from the university.

So don't risk it! If, for some reason, you can't do the homework on time or can't study for the Quiz, you're better off skipping it than cheating it. Do the math!

Students with Disability

Any student who feels he or she may need an accommodation based on the impact of a disability should contact me privately to discuss his or her specific needs. Also contact the Disability Services Center at (949) 824-7494 as soon as possible to better ensure that such accommodations are implemented in a timely fashion.