INFX 141 / CS 121 • DAVID G. KAY • UC IRVINE • WINTER 2015
Assignment #5
Search Engine
[Still subject to clarifying changes]
Like the last assignment, this one comes in two alternative versions, a Developer version (appearing immediately below) and an Analyst version (appearing later in this document). Both versions have milestones (intermediate deadlines), so mark your calendars and plan accordingly.
You may do this assignment individually or in groups of 2 or 3. As before, the expectations of each size group are the same; shared labor is offset by communication and coordination costs. All group members receive the same score except in truly extraordinary circumstances. You may use Java or Python to write the software you need for either part; if you'd like to use other tools not mentioned here, check with us. Use Piazza for general questions whose answers can benefit everybody.
DEVELOPER VERSION
In this assignment you will develop an entire search engine for a large collection of books: those avaiable free from Project Gutenberg. Here is a summary of the milestones:
Due Date | Description | Deliverables | Points | Evaluation Criteria | |
M1 | 23 February | Get the data | Screen shot (.jpg) | 10 | Were you able to get the data? |
M2 | 2 March | Lucene up and running | ZIP file with code | 20 | Were you able to get the demo up and running with the small changes? Did you provide a correct index? |
M3 | 12 March | Search engine | Code and ZIP file with pictures | 20 | Were you able to index the entire collection? Does your search seem to work? Did you add the Author and Title fields? |
M4 | 12–16 March | Demonstrations | 50 | Does your search engine work for the TA's queries? Do you understand how it works? | |
Extra Credit | 12 March | Extra credit Web UI | Show at demo | 15 | Does the UI work? |
[Developer] M1. Get the data
Get the entire collection of English ebooks from Project Gutenberg. The collection is available here as an ISO file: http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_DVD_Project . (Just download the contents of the April 2010 DVD; don't include earlier versions.)
Deliverable: A screen shot (in JPG form) showing the folder structure of the collection with evidence that it's located on your computer (i.e., a directory path should be displayed that includes your computer's name).
[Developer] M2. Get the Lucene demo up and running
Download the Apache Lucene text search engine library from
lucene.apache.org
. Be sure to get the latest version, 4.10.3, and to download both (a) the package
containing the source code, so you can see all the example code, and
(b) the binary package so you don't need to build Lucene from scratch. Place both packages in the same folder.
If you'd like to try this in Python with the PyLucene wrapper,
follow the additional instructions at
http://lucene.apache.org/pylucene/
(and note that according to their page, PyLucene is compatible with Java Lucine version 4.10.1).
Once you have everything in place, navigate to the simple demo folder at contrib/demo
.
Own it—that is, copy and paste the code to your own project. Then:
http://www.ics.uci.edu/~kay/courses/i141/hw/asst5-test-text-files.zip
.
Deliverable: the modified IndexFiles.java
and SearchFiles.java
, or the equivalent Python files, and the Lucene index folder (and its contents) for the couple of files in the test run. Zip the index folder. Add any additional files you need to convey information to the TA.
[Developer] M3. Search engine for the Gutenberg collection
Make sure you understand the demo code in M2: Study the Lucene API and documentation. Here are some additional notes and requirements:
ETEXT*
in the Gutenberg collection..txt
extension that you find inside ZIP files.https://github.com/DmitryKey/luke
.
java.util.zip
.
Deliverable: Your source code as a Zip archive, plus one or more screen shots (JPG) of your index as seen by Luke. Make sure the picture(s) show the structure and the total number of documents in the index.
[Developer] M4. Demo
Keep an eye out for a message arranging the demonstrations and follow the instructions in that message.
[Developer] Extra credit: Give your search engine a Web UI
Instead of a command line interface, let people search the index with a web interface. Use whatever web framework you're familiar with. Tomcat is known to be a good container for Lucene apps.
The extra credit work will be assessed during the live demo; there is no separate deliverable. If you choose to do it, make sure to tell the TA that you did it.
ANALYST VERSION
In this assignment you will develop a search engine for a collection of poems, specifically haiku. The collection you will search is the Harold G. Henderson Memorial Award Collection available for free from the Haiku Society of America. This will be a physical search engine, not an electronic one, and you are going to be the agent of computation. Here is a summary of the milestones:
Due Date | Description | Deliverables | Points | Evaluation Criteria | |
M1 | 23 February | Get the data | PDF report | 10 | Were you able to get the data? Are the poems properly identified? Does the layout of the poems look as if it will work well for mechanical search? |
M2 | 2 March | Build the Index | PDF report | 20 | Is the index you provded correct: Does it map all words? Does it map the words to all poems in which they occur? Is TF-IDF computed correctly? |
M3 | 12 March | Search engine | PDF report | 20 | Does your search produce the correct poems with the correct ranking? Does your mechanical search device exist, with pictures to prove it? |
M4 | 12–16 March | Demonstrations | 50 | Are you able to search, given the TA's queries? Are your search results correct? Is your search process fast (shooting for 30 seconds for a three-term query)? | |
Extra Credit | 12 March | Extra credit: better search | Show at demo | 20 | Is your search able to retrieve documents containing words not in the query but related to it? |
[Analyst] Overview
Your "search engine" will be a physical one, which you will manipulate to carry out searches "by hand." We envision it something like this:
At the end, you will show the TA the components of your engine and use it to search for specific queries the TA will give, e.g., "lawless mother cooking." When the TA gives you a query, you will play the computer's role: search the index, make some calculations, and produce a ranked list of poems. This is your search process; you are the computing agent.
Your search should be as fast as possible and it should produce a ranked list of poems for each query, ordered by how well they satisfy the query (using TF-IDF as the scoring heuristic).
Materials: You will need:
Construction: The binder holds the collection of poems, one or more per sheet. Make sure to place identifiers on labels, sticking out past the page edge. You can decide what kind of identifier you need. The rolodex (or second binder) holds the index to the words.
[Analyst] M1. Get the data
Get the entire collection of haiku poems from the cited page and place them in a document. Make sure to give identifiers to all poems. This document will be the source of what you will be printing and filing in the binder. As such, think carefully about how many poems you place per page, and what kinds of identifiers you will use.
Deliverable: A PDF document with the collection of properly identified poems.
[Analyst] M2. Build the index
This is going to be the most time-consuming step of this project, so you should start it as early as possible, the first week of the project. Here is what you need to do: Scan through the poems and build an inverted index that maps words to poems. Do this in a document. Besides mapping words to poems, the index should also have the TF-IDF of the words in each poem. Consider doing a little bit of programming to help you compute all this data. (Partial credit will be given for incomplete indexes. If you show a complete index at milestone M4, your score for that milestone will improve.)
Deliverable: A PDF document with the index.
[Analyst] M3. Physical search engine
Print and mount the poems document and the index document into the physical devices, i.e., the rolodex and the binder. Place the necessary labels.Deliverable: A PDF document that explains the search process for the query “lawless mother cooking”. Include plenty of pictures of how your physical device is being used at each step. Explain the scoring process clearly.
[Analyst] M4. Demo
Bring your physical devices to your appointment with the TA.
[Analyst] Extra credt: Word associations
Sometimes people don’t know what they are searching for and use the wrong words. For example, users could say “mother” when they really mean “father” or “parent”; they could say “woods” for “forest”, and so on. This collection of poems has several words that are associated with each other. Find those words and add their occurrences in the index in a way that makes sense. The goal is for you to index those associated words so that if a user searches, for example, for “father,” the poems that have occurrences of “mother” will also be retrieved, possibly with a lower rank than poems with occurrences of “father”. (Do this not just for “mother” and “father” but for several other words in the collection that are associated with each other.) The goal here is to capture the deeper semantics of concepts beyond the face value of words. The extra credit work will be assessed during the live demo, there is no separate deliverable. If you choose to do it, make sure to tell the TA that you did it.