1 The entire HTML source is currently saved, to allow for experimentation in feature extraction methods. As we will describe in Section 5, it is sufficient to just store a short summary of a file.

2 In addition, it contains a function to retrieval and locally store the HTML source of all links accessible from the current page. Syskill & Webert analyzes the HTML source of a page to determine whether the page matches the user's profile. To avoid network transmission overhead during our experiments, we prefetch all pages. This also ensures that experience are repeatable if a page changes or is no longer accessible.

3 Although it would be possible to use an incremental learning algorithm, our current implementation is not incremental.

4 We have created a local copy of this page, removing much of the in-line graphics, since access to the full page at the remote site was slow and unreliable.

5 We experimented with a variant of TF-IDF that operated on 128 informative words. It did not perform as well as TF-IDF operating on all words.

6 This differs slightly from the previous average for naive Bayes because a different set of random examples was used in this experiment.

7 Of course, some pages may be considered interesting to a user not because of the content but because they point to many interesting pages. We do not believe Syskill & Webert users consider such pages interesting because Syskill and Webert requires a user to associate an index page with each topic and this already contains many links.