The term is "000000000" there are 12 occurences of that term in 5 documents. The term was found on the wikipedia pages for "Sinclair_Coefficients" 1 time, "Free_Software_Foundation" 4 times, "Apollonian_gasket" 4 times, "List_of_Sunderland_A.F.C._managers" 2 times and "Finite_field_arithmetic" 1 time.
The output of this program is two binary files.
The first
is a table in which each row has a term and a pointer into the second file.
The second file
Is a binary representation of the posting list that you can use the first table to do a random access lookup into.
Write a Java Program Score a query
This part calculates the cosine ranking score
The input to this part is the output of the previous program, plus the document table, plus a query from a user.
The output is a ranked list of the ten most relevant web pages in wikipedia.
Do not create an accumulator for any term which occurs more than 50,000 times.
Extra Credit
Create a web-based user interface which collects a query from a user and displays the results.
Either a web-page or a browser extension (harder)
What to turn in:
A sources.zip file containing your source code.
A report.txt file containing this information:
Names of your team members.
Size of each on-disk data structure that you are using. For example, if you are using 3 binary files to store your data structures this would be the description of each file and its size in Mega bytes.
Approximate size of your main in-memory data structures. For example, if you are keeping a look up table in memory, you should report the size (in MB) and description of this data structure.
A sample query that your program can process fast and the amount of time it takes to respond.
A sample query that takes longer to process (compared to other typical queries) and the amount of time it takes to response.
With an efficient implementation, your program should return results in a fraction of second. Therefore, 15 seconds is our expectation for a maxmum time to wait for a query respons. If your program takes longer than that, something is not working right.
Make sure to attend the discussion session on March 9th. All of the details will be discussed there.
Your programs will be reviewed by Yasser on Monday March 16th from 10 am - 12 in the ICS third floor lab. If you have a conflict with this time, send him an email (before March 16th) to schedule another time.
Submitting your assignment
Schedule an appointment with Yasser to review your final program.