Information Retrieval Assignments

INFX 141 / CS 121 • DAVID G. KAY • UC IRVINE • WINTER 2015

Assignment #5
Search Engine

[Still subject to clarifying changes]

Like the last assignment, this one comes in two alternative versions, a Developer version (appearing immediately below) and an Analyst version (appearing later in this document). Both versions have milestones (intermediate deadlines), so mark your calendars and plan accordingly.

You may do this assignment individually or in groups of 2 or 3. As before, the expectations of each size group are the same; shared labor is offset by communication and coordination costs. All group members receive the same score except in truly extraordinary circumstances. You may use Java or Python to write the software you need for either part; if you'd like to use other tools not mentioned here, check with us. Use Piazza for general questions whose answers can benefit everybody.

DEVELOPER VERSION

In this assignment you will develop an entire search engine for a large collection of books: those avaiable free from Project Gutenberg. Here is a summary of the milestones:

	Due Date	Description	Deliverables	Points	Evaluation Criteria
M1	23 February	Get the data	Screen shot (.jpg)	10	Were you able to get the data?
M2	2 March	Lucene up and running	ZIP file with code	20	Were you able to get the demo up and running with the small changes? Did you provide a correct index?
M3	12 March	Search engine	Code and ZIP file with pictures	20	Were you able to index the entire collection? Does your search seem to work? Did you add the Author and Title fields?
M4	12–16 March	Demonstrations		50	Does your search engine work for the TA's queries? Do you understand how it works?
Extra Credit	12 March	Extra credit Web UI	Show at demo	15	Does the UI work?

[Developer] M1. Get the data

Get the entire collection of English ebooks from Project Gutenberg. The collection is available here as an ISO file: http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project . (Just download the contents of the April 2010 DVD; don't include earlier versions.)

Deliverable: A screen shot (in JPG form) showing the folder structure of the collection with evidence that it's located on your computer (i.e., a directory path should be displayed that includes your computer's name).

[Developer] M2. Get the Lucene demo up and running

Download the Apache Lucene text search engine library from lucene.apache.org. Be sure to get the latest version, 4.10.3, and to download both (a) the package containing the source code, so you can see all the example code, and (b) the binary package so you don't need to build Lucene from scratch. Place both packages in the same folder. If you'd like to try this in Python with the PyLucene wrapper, follow the additional instructions at http://lucene.apache.org/pylucene/ (and note that according to their page, PyLucene is compatible with Java Lucine version 4.10.1).

Once you have everything in place, navigate to the simple demo folder at contrib/demo. Own it—that is, copy and paste the code to your own project. Then:

Change the demo so that the Usage messages for both IndexFiles and SearchFiles ends with a smiley face :-)
Run it over these 2 text files using IndexFiles. Then use the search program SearchFiles. The two test files can be found here: http://www.ics.uci.edu/~kay/courses/i141/hw/asst5-test-text-files.zip.

Deliverable: the modified IndexFiles.java and SearchFiles.java, or the equivalent Python files, and the Lucene index folder (and its contents) for the couple of files in the test run. Zip the index folder. Add any additional files you need to convey information to the TA.

[Developer] M3. Search engine for the Gutenberg collection

Make sure you understand the demo code in M2: Study the Lucene API and documentation. Here are some additional notes and requirements:

Ignore ETEXT* in the Gutenberg collection.
Study the raw data; you will notice that things aren't completely consistent. Some books are provided in Rich Text Format, others are plan text, zipped. Focus on the plain text books, i.e., the ones with the .txt extension that you find inside ZIP files.
Add Title and Author fields to your Lucene "Documents." Include more fields you think will improve the quality of search results. (The challenge here is to devise how to scrape that information out of the raw data.)
Boost matches in the Title and Author fields, and any other fields you see fit.
Luke is a very useful tool for inspecting Lucene indexes; download it at https://github.com/DmitryKey/luke.
To unzip files in Java, look at java.util.zip.

Deliverable: Your source code as a Zip archive, plus one or more screen shots (JPG) of your index as seen by Luke. Make sure the picture(s) show the structure and the total number of documents in the index.

[Developer] M4. Demo

Keep an eye out for a message arranging the demonstrations and follow the instructions in that message.

[Developer] Extra credit: Give your search engine a Web UI

Instead of a command line interface, let people search the index with a web interface. Use whatever web framework you're familiar with. Tomcat is known to be a good container for Lucene apps.

The extra credit work will be assessed during the live demo; there is no separate deliverable. If you choose to do it, make sure to tell the TA that you did it.

ANALYST VERSION

In this assignment you will develop a search engine for a collection of poems, specifically haiku. The collection you will search is the Harold G. Henderson Memorial Award Collection available for free from the Haiku Society of America. This will be a physical search engine, not an electronic one, and you are going to be the agent of computation. Here is a summary of the milestones:

	Due Date	Description	Deliverables	Points	Evaluation Criteria
M1	23 February	Get the data	PDF report	10	Were you able to get the data? Are the poems properly identified? Does the layout of the poems look as if it will work well for mechanical search?
M2	2 March	Build the Index	PDF report	20	Is the index you provded correct: Does it map all words? Does it map the words to all poems in which they occur? Is TF-IDF computed correctly?
M3	12 March	Search engine	PDF report	20	Does your search produce the correct poems with the correct ranking? Does your mechanical search device exist, with pictures to prove it?
M4	12–16 March	Demonstrations		50	Are you able to search, given the TA's queries? Are your search results correct? Is your search process fast (shooting for 30 seconds for a three-term query)?
Extra Credit	12 March	Extra credit: better search	Show at demo	20	Is your search able to retrieve documents containing words not in the query but related to it?

[Analyst] Overview

Your "search engine" will be a physical one, which you will manipulate to carry out searches "by hand." We envision it something like this:

Rolodex Binder with tabs

At the end, you will show the TA the components of your engine and use it to search for specific queries the TA will give, e.g., "lawless mother cooking." When the TA gives you a query, you will play the computer's role: search the index, make some calculations, and produce a ranked list of poems. This is your search process; you are the computing agent.

Your search should be as fast as possible and it should produce a ranked list of poems for each query, ordered by how well they satisfy the query (using TF-IDF as the scoring heuristic).

Materials: You will need:

All the haiku [poems] on the page cited above.
A binder, sheets of white paper, and labels (possibly small Post-Its)
A rolodex, or another binder with tabs
A calculator or calculator app

Construction: The binder holds the collection of poems, one or more per sheet. Make sure to place identifiers on labels, sticking out past the page edge. You can decide what kind of identifier you need. The rolodex (or second binder) holds the index to the words.

[Analyst] M1. Get the data

Get the entire collection of haiku poems from the cited page and place them in a document. Make sure to give identifiers to all poems. This document will be the source of what you will be printing and filing in the binder. As such, think carefully about how many poems you place per page, and what kinds of identifiers you will use.

Deliverable: A PDF document with the collection of properly identified poems.

[Analyst] M2. Build the index

This is going to be the most time-consuming step of this project, so you should start it as early as possible, the first week of the project. Here is what you need to do: Scan through the poems and build an inverted index that maps words to poems. Do this in a document. Besides mapping words to poems, the index should also have the TF-IDF of the words in each poem. Consider doing a little bit of programming to help you compute all this data. (Partial credit will be given for incomplete indexes. If you show a complete index at milestone M4, your score for that milestone will improve.)

Deliverable: A PDF document with the index.

[Analyst] M3. Physical search engine

Print and mount the poems document and the index document into the physical devices, i.e., the rolodex and the binder. Place the necessary labels.

Deliverable: A PDF document that explains the search process for the query “lawless mother cooking”. Include plenty of pictures of how your physical device is being used at each step. Explain the scoring process clearly.

[Analyst] M4. Demo

Bring your physical devices to your appointment with the TA.

[Analyst] Extra credt: Word associations

Sometimes people don’t know what they are searching for and use the wrong words. For example, users could say “mother” when they really mean “father” or “parent”; they could say “woods” for “forest”, and so on. This collection of poems has several words that are associated with each other. Find those words and add their occurrences in the index in a way that makes sense. The goal is for you to index those associated words so that if a user searches, for example, for “father,” the poems that have occurrences of “mother” will also be retrieved, possibly with a lower rank than poems with occurrences of “father”. (Do this not just for “mother” and “father” but for several other words in the collection that are associated with each other.) The goal here is to capture the deeper semantics of concepts beyond the face value of words. The extra credit work will be assessed during the live demo, there is no separate deliverable. If you choose to do it, make sure to tell the TA that you did it.

David G. Kay, kay@uci.edu
Wednesday, February 25, 2015 7:25 PM