Assignment 8. Disease Diagnosis
In this assignment you will apply standard Machine Learning algorithms as implemented in the Weka suite to proteomic data. Some classifiers make additional unstated assumptions about the data. If that is not satisfied, the algorithm bombs. You may need to try several classifiers.
  1. Weka should be installed on the ICS machines. If you want to run on your own machine, use any search engine on Weka and download the free, open source software. Every time I download the software, it is a bit different. They change the interface and update the algorithms.
  2. Get the file mouse.arff which is on masterhit. This is a file of proteomic data generated by a Ciphergen machine at UCI in Steve Lipkin's laboratory.
  3. Run the classifiers Naive Bayes, IB1, and J48 and one more classifier(of your choice) and report the results on predicting the disease.
  4. Turn in a *.doc file with the following information. The algorithms should be evaluated via 10-fold cross-validation, unless this is too expensive. Do not simply paste the results from Weka into a *.doc file. Weka produces a lot of output. Only include information that you interpret. At a minimum you should report the generalization accuracy, precision, and recall for each algorithm, which algorithm performs best, and, if possible, why. These results can all be computed from the confusion matrix. Also for each algorithm give a short description (a few sentences) of what the algorithm computes. You may need to do a little background reading for this, but much of this was covered in ics171. You can also find information on the web and in the documentation for Weka.