Assignment 2, CS 175: Project in Artificial Intelligence

Due at 12 noon Wednesday October 10th


Instructions for Assignment 2

The goal of this week's assignment is to construct some basic "building blocks" for our learning algorithms, in particular to build a "nearest-neighbor classifier", and to get you further acquainted with MATLAB and its plotting capabilities in particular.

The assignment consists of three parts. In part 1, you will load in two simulated data sets and look at their properties. In part 2, you will implement a k-nearest neighbor classification algorithm, in part 3 you will visualize the errors that this classifier makes, and in part 4 you will test if and how the accuracy of this model can change as a function of k.

Before you begin the programming part of the Assignment you should review the following in the MATLAB documentation. Some of this will be review from the tutorial material you covered in Week 1 and some will be new material. You can find these topics in MATLAB software under the "Help -> MATLAB Help" menu item.


Part 1: Classification Data Sets

You need to download two different ".mat" files called simdata1.mat and simdata2.mat. You should save each of these in your local working directory. The two files consist of two different classification data sets, to be referred to as SimulatedData1 and SimulatedData2. Each data set is stored as a binary MATLAB file, simdata1.mat and simdata2.mat. You can load them into MATLAB simply by typing (for simdata1 for example):
>> load simdata1;
>> simdata1

simdata1 = 

      shortname: 'Simulated Data 1'
    numfeatures: 2
     classnames: [2x6 char]
     numclasses: 2
    description: [1x66 char]
       features: [200x2 double]
    classlabels: [200x1 double]

>> simdata1.numclasses

ans =

     2

>> simdata1.description

ans =

Simulated separable 2-class 2-dimensional data, CS 175, Fall 2007
where here we see that simdata1 is a MATLAB structure with several different fields. The fields tell us that "simdata1" has 2 classes, 2 features, and contains some other background information. simdata1.features contains the actual feature values, which has dimension 200 rows by 2 columns. Each row is a particular example, the  two columns are the two feature values for this feature (the "feature vector"). simdata1.classlabels contains the classlabels for the corresponding features in simdata1.features, i.e., the first entry is the class label for the first row in simdata1.features and the class labels are assumed to take value either 1 or 2.

Your first assignment is to load in the data, and write a simple function which takes as input a data structure with the same fields as "simdata1" above, and produces a figure which plots a 2-dimensional plot of the feature vectors, i.e., the x-axis is feature 1 (column 1) and the y-axis is feature 2 (column 2) (this is called a "scatter plot"). The function should also take 2 optional arguments, namely 2 integers which specify which features (columns) are to be plotted: this is useful for data sets where there are more than 2 features, since we can only plot two at a time in this manner. If these two arguments are not specified, the function should choose features 1 and 2 to plot by default.

The function should also plot the points from each class in different colors (you can assume if you wish that there are only 2 classes, named "1" and "2": most of our project assignments will likely only involve 2-class problems). The commands

>> plot(d(:,x), d(:,y), 'r*')
will plot the values of feature y against feature x, using red "stars". You will need to figure out how to generalize this so that it plots data points from class 1 in one color and data points from class 2 in another color. The "hold" command is useful for keeping the results of several plot commands on the screen. You will also need to use the logical subscripting feature of MATLAB to find which examples belong to which class.

Your function needs have the following header information (you need to give it this exact name and have these arguments in this order).

  function classplot(data, x, y);
  %   function classplot(data, x, y);
  %
  %  brief description of what the function does
  %  ......
  %                            Your Name, CS 175, date
  %
  %  Inputs
  %    data:  (a structure with the same fields as described above:
  %           your comment header should describe the structure explicitly)
  %           Note that if you are only using certain fields in the structure
  %           in the function below, you need only define these fields in the input comments
 
      --------   Your code goes here -------

Feel free to add other features such as labeling the axis, adding a title to the plot, etc.

Once you have written this function you should be able to call this function with as classplot(simdata1, 1, 2)  and see that  the data from 2 classes can be perfectly separated in the first dimension (the x-axis).  If you call classplot(simdata2, 1, 2)  you should see that the 2 classes are now overlapped and each cloud of points has a"non-linear" shape, and the 2 classes can not be separated with a single line.
 

Part 2: k-Nearest Neighbor Classifier

You are to implement a MATLAB function which implements the k-NN classifier. The pseudocode for this algorithm is roughly as follows. Note that there is no "training" as such for this classifier, i.e., it just takes a training data set and uses it as a "memory" for classifying new feature vectors.
   Read in the training data set Dtrain

       y = feature vector to be classified

       kneighbors =  k-nearest neighbors to y in Dtrain

       kclasses = class values of the kneighbors

       kvote = the most common target value in kclasses

       predicted_class(y) = kvote
This is a fairly high-level description. The only tricky part is finding the k-neighbors. You should use the Euclidean distance to define "closeness". Also remember that in Assignment 1 you already implemented a function to find the single closest neighbor, so you just need to generalize that function to find the k nearest neighbors.

You are to implement a MATLAB function which takes as input a labeled training data set (with labels) and a test data set (without labels) on which you will make predictions. The function returns the class predictions for the test data set. Note that to call this code you will need to extract the feature data matrix and the class label vector from the general structure in the last section - you may want to write a simple additional function that will do this for you.

  function [class_predictions] = knn(traindata,trainlabels,k, testdata)
  % function [class_predictions] = knn(traindata,trainlabels,k, testdata)
  %
  %  a brief description of what the function does
  %  ......
  %                            Your Name, CS 175, date
  %
  %  Inputs
  %     traindata: a N1 x d vector of feature data (the "memory" for kNN)
  %     trainlabels: a N1 x 1 vector of classlabels for traindata
  %     k:     an odd positive integer indicating the number of neighbors to use
  %     testdata: a N2 x d vector of feature data for testing the knn classifier 
  %
  %  Outputs
  %     class_predictions: N2 x 1 vector of predicted class values

      --------   Your code goes here -------
Please make sure that you implement your function so that it accepts input arguments in exactly the format specified above.   
 
 

Part 3: Plotting the Errors for a k-Nearest Neighbor Classifier

In this section of the assignment you will write a function that (1) builds a kNN classifier (using the code from Section 2), then (2) calculates the error rate for the classifier and prints this to the screen, and finally (3) plots the testdata using the first 2 features where the data from each class is color-coded (same as in the code for the first part of this assignment) and then also overlays circles on each of the data points that were misclassified by the classifier. The top few lines of your code should look like:

function knn_plot(traindata,trainlabels,k,testdata,testlabels);
% function knn_plot(traindata,trainlabels,k,testdata,testlabels);
%
% Predicts class-labels for the data in testdata using the k nearest
% neighbors in traindata, and then plots the data (using the first
% 2 dimensions or first 2 features), displaying the data from each
% class in different colors, and overlaying circles on the points
% that were incorrectly classified.
%
%  Inputs
%     traindata: a N1 x d vector of feature data (the "memory" for kNN)
%     trainlabels: a N1 x 1 vector of classlabels for traindata
%     k:     an odd positive integer indicating the number of neighbors to use
%     testdata: a N2 x d vector of feature data for testing the knn classifier 
%     trainlabels: a N2 x 1 vector of classlabels for traindata

You may find it useful to use the following in plotting the circles:
plot(xvalues, yvalues,'bo','MarkerSize',10);

You should then test this code with both simdata1 and simdata2 with different values of k, and where you split the data in simdata1 and simdata2 into training and testing subsets (e.g., the first 50% of the rows of the feature data for training, the 2nd 50% for testing - or randomly partition the rows into training and testing subsets). You should find that there are no errors produced on simdata1 (for different k-values and different partitions of the data) but that the error rate for simdata2 is typically 7 to 10% or higher.
 

Part 4: Calculating the Error-Rate of the k-Nearest Neighbor Classifier on SimulatedData2, as the number of neighbors k is varied

You are to implement a function that will take in a training data set with labels, a test data set with labels, and calculate the errors of the k-nearest-neighbor classifier for different values of k, k going from 1, 3, 5, 7, to a value kmax. The function returns a vector of error-rates of the classifier on the test data set for each value of k. The error rate is defined as the number of points incorrectly classified divided by the total number of points, and expressed as a percentage. The function also takes an optional parameter "plotflag", which if set to 1 causes the function to plot the accuracy as a function of k. The pseudocode for this is as follows:
 
       Read in the training data set Dtrain, and Dtest
       For k = 1, 3, 5, ... Kmax (odd numbers)   

             classify each point in Dtest using the k nearest neighbors in Dtest
             error_k = 100*(number of points incorrectly classified)/(number of points in Dtest)
       end

You are to implement a MATLAB function which implements this pseudocode.  Your MATLAB function must have this input/output format:
  function [errors] = knn_error_rates(traindata,trainlabels, testdata, testlabels,kmax,plotflag)
  % function [errors] = knn_error_rates(traindata,trainlabels, testdata, testlabels,kmax,plotflag)
  %
  %  a brief description of what the function does
  %  ......
  %                            Your Name, CS 175, date
  %
  %  Inputs
  %     traindata: a N1 x d vector of feature data (the "memory" for kNN)
  %     trainlabels: a N1 x 1 vector of classlabels for traindata
  %     testdata: a N2 x d vector of feature data for testing the knn classifier 
  %     testlabels: a N2 x 1 vector of classlabels for traindata
  %     kmax: an odd positive integer indicating the maximum number of neighbors 
  %     plotflag: (optional argument) if 1, the error-rates versus k is plotted,
  %                                   otherwise no plot.
  %
  %  Outputs
  %     errors: r x 1 vector of error-rates on testdata, where r is the
  %                 number of values of k that are tested.

      --------   Your code goes here -------
Please make sure that you implement your function so that it accepts input arguments in exactly the format specified above. Note that there are many possible tricks in terms of trying to speed code like this up. You could for example calculate the Kmax nearest neighbors first, store the results, and the just use these results to generate the results for kmax-2, kmax-4, (without recalculating distances), and so on down to k = 1. However, you should at least use vectorization where possible, e.g., in computing the Euclidean distance between a vector and each row in a matrix. The less efficient you make your code, the longer it will take to run the experiment below.

Traindata and testdata are to be defined as follows. Traindata is the first 1000 rows of simdata2.features and testdata is the second 1000 rows, i.e.,

traindata = simdata2.features(1:1000,:);

testdata = simdata2.features(1001:2000,:);

(trainlabels and testlabels are defined as the corresponding class label values in simdata2)

You are to use your code to calculate the error-rates on the test data for all values of k = 1,3, 5, .. 75 using these training and test data sets.
Write a brief one-page summary in of your interpretation of the results in Part 4, i.e., how the classification error-rate varies (or does not vary) as a function of k. Include in your document a graph of test error-rates versus k (k=1, 3, 5, .... 49),  produced by your code in Part 3 (e.g., you can simply cut and paste the graph from MATLAB to Word, just select "Copy Figure" under the "Edit" menu in the Figure window in MATLAB).
 

 

Part 5 (Optional, Extra Credit) Plotting the Classification Boundaries of the k-Nearest Neighbor Classifier:
A useful function would be to automatically plot the decision boundary that is implicitly defined by a training data set, i.e., given a training data set with class labels (for up to 5 classes) it draws the resulting decision boundary for a kNN classifier (where k is specified by the user). You may want to look at the voronoi.m function in MATLAB to see if it provides any ideas on how to do this. Your function should be called knn_decision_boundaries.m

Note: Part 5  is for extra-credit only - if the program works completely correctly and is well-documented you will get 1 additional bonus point (out of 100) towards your grade in the class. No partial credit on this. You should only attempt this if you are confident that you understand MATLAB well at this point and have completed and tested all of the other parts of this assignment,)


What to Turn In