Note: Start this homework early. You may have to use the computers at school but you can download the free open-source Weka software

Also read over all the questions before you start, otherwise you may have to repeat some experiments.

In this homework you will use the Weka software to analyze the iris data set, a standard data set used to evaluated statistical algorithms. This data set is provided with Weka.

For each of the algorithms below, report the accuracies using 2-fold and 10-fold cross-validation.

    1. ZeroR which is the dumbest algorithm of all. It’s the baseline.

    2. k-Nearest-Neighbor (Ibk) with k = 1, k=3, and k = 5. In the IBL folder.

    3. j48: the decision tree algorithm. In the Trees folder.

    4. Part: the algorithm for generating rules. In the Rules folder.

    5. Naïve Bayes: a statistical approach using Bayes Rule. In the Bayes folder

    1. Do you expect that 2-fold or 10-fold CV will yield a higher estimate of the accuracy of the algorithm?
    2. Why?
    3. Does your data support this conclusion? Be specific.
  1. Which learning methods produced interpretable results?
  2. From the 10-fold CV data, order the algorithms by accuracies.
  3. For the remaining questions, only consider the decision tree algorithm with 10 fold CV. Report the confusion matrix for the decision tree algorithm
  4. List, in order, the classes predicted with highest precision, i.e. the probability that the example was of class "a" given that the algorithm predicted it was of class "a". Show how the probabilities were computed.
  5. List, in order, the classes predicted with highest recall, i.e. the probability that the example was predicted to be of class "a" given that it was of class "a". Show how the probabilities were computed.