The assignment consists of three parts. In part 1, you will load in
two simulated data sets and look at their properties. In part 2, you
will
implement a k-nearest neighbor classification algorithm, in part 3 you will visualize the errors that this classifier makes, and in part 4 you
will test if and how the accuracy of this model can change as a
function
of k.
Before you begin the programming part of the Assignment you should
review the following in the MATLAB documentation. Some of this will be
review from the tutorial material you covered in Week 1 and some will
be new material. You can find these topics in MATLAB software under the "Help -> MATLAB Help" menu item.
>> load simdata1;where here we see that simdata1 is a MATLAB structure with several different fields. The fields tell us that "simdata1" has 2 classes, 2 features, and contains some other background information. simdata1.features contains the actual feature values, which has dimension 200 rows by 2 columns. Each row is a particular example, the two columns are the two feature values for this feature (the "feature vector"). simdata1.classlabels contains the classlabels for the corresponding features in simdata1.features, i.e., the first entry is the class label for the first row in simdata1.features and the class labels are assumed to take value either 1 or 2.
>> simdata1
simdata1 =
shortname: 'Simulated Data 1'
numfeatures: 2
classnames: [2x6 char]
numclasses: 2
description: [1x66 char]
features: [200x2 double]
classlabels: [200x1 double]
>> simdata1.numclasses
ans =
2
>> simdata1.description
ans =
Simulated separable 2-class 2-dimensional data, CS 175, Fall 2007
Your first assignment is to load in the data, and write a simple function which takes as input a data structure with the same fields as "simdata1" above, and produces a figure which plots a 2-dimensional plot of the feature vectors, i.e., the x-axis is feature 1 (column 1) and the y-axis is feature 2 (column 2) (this is called a "scatter plot"). The function should also take 2 optional arguments, namely 2 integers which specify which features (columns) are to be plotted: this is useful for data sets where there are more than 2 features, since we can only plot two at a time in this manner. If these two arguments are not specified, the function should choose features 1 and 2 to plot by default.
The function should also plot the points from each class in different colors (you can assume if you wish that there are only 2 classes, named "1" and "2": most of our project assignments will likely only involve 2-class problems). The commands
>> plot(d(:,x), d(:,y), 'r*')will plot the values of feature y against feature x, using red "stars". You will need to figure out how to generalize this so that it plots data points from class 1 in one color and data points from class 2 in another color. The "hold" command is useful for keeping the results of several plot commands on the screen. You will also need to use the logical subscripting feature of MATLAB to find which examples belong to which class.
Your function needs have the following header information (you need to give it this exact name and have these arguments in this order).
function classplot(data, x, y);Feel free to add other features such as labeling the axis, adding a title to the plot, etc.
% function classplot(data, x, y);
%
% brief description of what the function does
% ......
% Your Name, CS 175, date
%
% Inputs
% data: (a structure with the same fields as described above:
% your comment header should describe the structure explicitly)
% Note that if you are only using certain fields in the structure
% in the function below, you need only define these fields in the input comments
-------- Your code goes here -------
Read in the training data set DtrainThis is a fairly high-level description. The only tricky part is finding the k-neighbors. You should use the Euclidean distance to define "closeness". Also remember that in Assignment 1 you already implemented a function to find the single closest neighbor, so you just need to generalize that function to find the k nearest neighbors.
y = feature vector to be classified
kneighbors = k-nearest neighbors to y in Dtrain
kclasses = class values of the kneighbors
kvote = the most common target value in kclasses
predicted_class(y) = kvote
You are to implement a MATLAB function which takes as input a labeled training data set (with labels) and a test data set (without labels) on which you will make predictions. The function returns the class predictions for the test data set. Note that to call this code you will need to extract the feature data matrix and the class label vector from the general structure in the last section - you may want to write a simple additional function that will do this for you.
function [class_predictions] = knn(traindata,trainlabels,k, testdata)Please make sure that you implement your function so that it accepts input arguments in exactly the format specified above.
% function [class_predictions] = knn(traindata,trainlabels,k, testdata)
%
% a brief description of what the function does
% ......
% Your Name, CS 175, date
%
% Inputs
% traindata: a N1 x d vector of feature data (the "memory" for kNN)
% trainlabels: a N1 x 1 vector of classlabels for traindata
% k: an odd positive integer indicating the number of neighbors to use
% testdata: a N2 x d vector of feature data for testing the knn classifier
%
% Outputs
% class_predictions: N2 x 1 vector of predicted class values
-------- Your code goes here -------
Read in the training data set Dtrain, and Dtest
For k = 1, 3, 5, ... Kmax (odd numbers)
classify each point in Dtest using the k nearest neighbors in Dtest
error_k = 100*(number of points incorrectly classified)/(number of points in Dtest)
endYou are to implement a MATLAB function which implements this pseudocode. Your MATLAB function must have this input/output format:
function [errors] = knn_error_rates(traindata,trainlabels, testdata, testlabels,kmax,plotflag)Please make sure that you implement your function so that it accepts input arguments in exactly the format specified above. Note that there are many possible tricks in terms of trying to speed code like this up. You could for example calculate the Kmax nearest neighbors first, store the results, and the just use these results to generate the results for kmax-2, kmax-4, (without recalculating distances), and so on down to k = 1. However, you should at least use vectorization where possible, e.g., in computing the Euclidean distance between a vector and each row in a matrix. The less efficient you make your code, the longer it will take to run the experiment below.
% function [errors] = knn_error_rates(traindata,trainlabels, testdata, testlabels,kmax,plotflag)
%
% a brief description of what the function does
% ......
% Your Name, CS 175, date
%
% Inputs
% traindata: a N1 x d vector of feature data (the "memory" for kNN)
% trainlabels: a N1 x 1 vector of classlabels for traindata
% testdata: a N2 x d vector of feature data for testing the knn classifier
% testlabels: a N2 x 1 vector of classlabels for traindata
% kmax: an odd positive integer indicating the maximum number of neighbors
% plotflag: (optional argument) if 1, the error-rates versus k is plotted,
% otherwise no plot.
%
% Outputs
% errors: r x 1 vector of error-rates on testdata, where r is the
% number of values of k that are tested.
-------- Your code goes here -------
Traindata and testdata are to be defined as follows. Traindata is the first 1000 rows of simdata2.features and testdata is the second 1000 rows, i.e.,
traindata = simdata2.features(1:1000,:);
testdata = simdata2.features(1001:2000,:);
(trainlabels and testlabels are defined as the corresponding class label values in simdata2)
You are to use your code to calculate the error-rates on the test
data
for all values of k = 1,3, 5, .. 75 using these training and test data
sets.
Write a brief one-page summary in of your interpretation of the
results in
Part 4, i.e., how the classification error-rate varies (or does not
vary) as a function
of k. Include in your document a graph of test error-rates versus
k (k=1,
3, 5, .... 49), produced by your code in Part 3 (e.g., you can
simply
cut
and paste the graph from MATLAB to Word, just select "Copy Figure"
under
the "Edit" menu in the Figure window in MATLAB).
Part
5 (Optional, Extra Credit) Plotting the Classification Boundaries of
the k-Nearest Neighbor Classifier:
A useful function would be to automatically plot the decision
boundary that is implicitly defined by a training data set, i.e., given
a training data set with class labels (for up to 5 classes) it
draws
the resulting decision boundary for a kNN classifier (where k is
specified by the user). You
may want to look at the voronoi.m function in MATLAB to see if it
provides any ideas on how to do this. Your function should be called
knn_decision_boundaries.m
Note: Part 5 is for extra-credit only - if the program works
completely correctly and is well-documented you will get 1 additional
bonus point (out of 100) towards your grade in the class. No partial
credit on this. You should only attempt this if you are confident that
you understand MATLAB well at this point and have completed and tested
all of the other parts of this assignment,)