Submit a postings.txt file in which each line contains postings of a term. All terms should be in lowercase and in the content (not the tags) of the web page.
<term, cf, df> <docid1, tf1> <docid2, tf2> ....
where cf is the collection frequency of the term, df is the document frequency of term (Number of documents containing this term).
tf1 is the term frequency of "term" in document with docid1 and so on.
Before uploading this file, truncate it and only keep the LAST 1000 lines.
Submit a report.txt file with this content:
Total number of distinct terms.
List of top 10 words with lowest document frequency.
List of top 10 words with highest document frequency.
First Name of member 1 of your team: ...
List of (at most 1000) URLs containing first name of member 1 of your team: ...