For the experiments below you will use the same type of data we used
to train
the nearest neighbor classifier in Assignment 2. The data can be found
in the files sampledata1.mat and sampledata2.mat. Sampledata1 contains a dataset of size 20 x 2,
with a vector of targets of size 20 x 1 (with values +1 and -1,
corresponding
to the 2 classes), and sampledata2 contains a dataset of size 40x2 and
targets of size 40 x 1.
Part 1: Calculating Perceptron Outputs and Errors
You will implement a function that takes a set of weights and set of feature vectors and returns the classification decisions as calculated by the perceptron using those weights. This is a simple function. Your weight vector should have dimensions 1 x (d+1), i.e., a row vector with d+1 elements. Here d is the number of features, and components 1 to d of the weight vector correspond to features 1 to d. The d+1th component of the weight vector is the weight for the constant input (where the constant input is set to 1).
The other argument to this function is a matrix called "data", of dimension N x (d+1), where the first d columns are the d different features and the (d+1)th column is a column of all 1's, for the constant input. It is probably useful if you just define this "augmented" version of the feature data (with the extra column of 1's) early on in your perceptron learning experiments and functions (you can just define it at the MATLAB prompt). N is the total number of training examples.
Your function needs have the following header information (you need
to give it this exact name and have these arguments in this order).
function [thresholded_outputs] = perceptron(weights,data)
% function [thresholded_outputs] = perceptron(weights,data)
%
% brief description of what the function does
% ......
% Your Name, CS 175, date
%
% Inputs
% weights: 1 x (d+1) row vector of weights
% data: N x (d+1) matrix of training data
%
% Outputs
% thresholded_outputs: N x 1 vector of thresholded perceptron outputs (each entry must be +1 or -1)
-------- Your code goes here -------
Part 2: Calculating the Classification Error and Mean Square Error (MSE) for a Perceptron
The next function to be implemented takes a vector of weights and a set of data vectors and targets as described earlier, and then calculates both the mean-square error and the classification error of the perceptron outputs when compared to the target values (we will be discussing in class how these are calculated). For the mean-square error calculation you use "unthresholded" outputs, namely sigmoid(weights * inputs) as discussed in class. For the classification error you threshold these real-valued numbers (to get a class prediction, +1 or -1) and then compare this to the true targets. For perceptron models, the true targets will be +1 and -1 (this will be assumed by the function below - if you wish you could add some code to check that the targets provided as inputs do indeed only take valus +1 or -1).. If you wish this function can call perceptron.m
Your function needs have the following header information (you need
to give it this exact name and have these arguments in this order).
function [cerror, mse] = perceptron_error(weights,data,targets)
% function [cerror, mse] = perceptron_error(weights,data,targets)
%
% brief description of what the function does
% ......
% Your Name, CS 175, date
%
% Inputs
% weights: 1 x (d+1) row vector of weights
% data: N x (d+1) matrix of training data
% targets: N x 1 vector of target values (+1 or -1)
%
% Outputs
% cerror: the percentage of examples misclassified (between 0 and 100), using thresholded perceptron outputs
% mse: the mean-square error (sum of squared errors between the targets and the unthresholded outputs, divided by N)
-------- Your code goes here -------
Initialize each weight to a small randomly chosen valueThis is fairly high-level pseudo-code. The details on how to update the weights and how to define the convergence criterion will need to specified by you.
iteration=0;
While (convergence_criterion not achieved)
for i=1:N
calculate the output of the network for input i
for j = 1: d+1
update weight j (see class slides for equation)
end
end
calculate convergence_criterion
++iteration
(optional) plot current location of decision boundary
end
function [weights,mse,acc] = learn_perceptron(data,targets,rate,threshold,init_method,random_seed,plotflag,k)
% function [weights,mse,acc] = learn_perceptron(data,targets,rate,threshold,init_method,random_seed,plotflag,k)
%
% brief description of what the function does
% ......
% Your Name, CS 175, date
%
% Inputs
% data: N x (d+1) matrix of training data
% targets: N x 1 vector of target values (+1 or -1)
% rate: learning rate for the perceptron algorithm (e.g., rate = 0.001)
% threshold: if the reduction in MSE from one iteration to the next is *less*
% than threshold, then halt learning (e.g., threshold = 0.000001)
% init_method: method used to initialize the weights (1 = random, 2 = half
% way between 2 random points in each group, 3 = half way between
% the centroids in each group)
% random_seed: this is an integer used to "seed" the random number generator
% for either methods 1 or 2 for initialization (this is useful
% to be able to recreate a particular run exactly)
% plotflag: 1 means plotting is turned on, default value is 0
% k: how many iterations between plotting (e.g., k = 100)
%
% Outputs
% weights: 1 x (d+1) row vector of learned weights
% mse: mean squared error for learned weights (sum of squared errors divided by N)
% acc: classification accuracy for learned weights (percentage, between 0 and 100)
-------- Your code goes here -------
The input argument "rate" specifies the learning rate of the algorithm.
As discussed in class, you may have to experiment a little to find a
good
setting for the rate for a given problem. If the rate is too small, the
algorithm may converge *very* slowly. If the rate is too large it may
not
converge at all since it may take steps which are too large. Rate
values
between 0.001 and 0.0001 work reasonably well for the data sets in
sampledata1
and sampledata2. Note that if the rate is too large, the weights can
quickly
diverge to very large values (you will see the mean-squared error grow
very quickly), indicating that you should try a smaller value. The mean
square error from iteration to iteration will *increase* rather than
decrease
when this happens, and you should put a check in your algorithm so that
it halts if the mean square error increases.
There are several different ways to define a convergence criterion for the algorithm. Essentially one wants to halt the learning once it appears that the perceptron has settled into a global minimum of the error surface. This will be reflected by the fact that the mean square error is hardly changing from one iteration to the next. One simple way to implement this is to compare the value of the mean-squared error on one iteration to the value at the previous iteration. If the decrease in error is less than the input argument "threshold", then halt, where "threshold" is a very small threshold value which we can set. The smaller the threshold is, the more stringent we are in determining convergence and the more iterations the algorithm will take. A threshold value of 0.00001 or smaller is fairly typical.
As discussed in class, we need to select an initial set of weights for the perceptron to start the learning process. To help you with this, you can call the function initialize_weights175.m. You will need to figure out how to call this function (it is quite simple). Basically it gives you 3 different ways to initialize the weights: method 1 is a random set of weights, method 2 selects 2 points randomly (1 from each class) and uses weights that define a decision boundary half way between these points, and method 3 is like method 2 except that instead of 2 random points it selects the centroids (or 2-dimensional mean) from each class. You can experiment with these methods and will find out that there can be quite a difference in how quickly the algorithm converges to a solution depending on where it is started from.
The input argument "random_seed" is an integer that specifies the value which is used by the function to seed the pseudo-random number generator. The random number generator in initialize_weights175.m uses this number in selecting the initial weights (in methods 1 and 2), so specifying the same seed in two different invocations of the algorithm will generate the same set of initial weights (allowing us to repeat exactly a particular experiment).
You can ignore the input arguments plotflag and k until we get to
Part
5 below.
Part 4: Plotting the Rate of Convergence of the Algorithm
From plotting the data in Assignment 2 you should have found that the data in sampledata1.mat is linearly separable and that the data in sampledata2.mat is not. So we expect that the perceptron can learn to classify simdata1 with zero error, but will not be able to get zero classification error on simdata 2.
Now further modify your function learn_percptron.m so that when
plotflag=1,
your function plots (on 2 separate figures, after the algorithm has
converged)
the value of the mean-square error function and the value of the
classification
accuracy, as a function of the iteration number, i.e., a graph where
the
x-axis is the number of iterations (going from 1 to the total number of
iterations taken by the algorithm) and the y-axis is the mean-square
error
function or the classification accuracy (one for each graph). To do
this
you will need to store in an array the value of the mean-squared error,
and the accuracy, at each iteration as learning proceeds. This will
allow
you to see how the perceptron converged to a solution, for a given data
set and a given set of control parameters (such as the learning rate).
If you wish, include on the plot a text string that prints out the name
of the data set being used, the value of the learning rate, the
convergence
threshold, and any other information you like (this will make it easier
to read the plots when you start generating multiple plots for
different
data sets and different learning rates).
Specifically, if plotflag=1, your code should do the following (in
addition
to the plots from Part 4)
Iteration 0: create a figure and plot the perceptron decision boundary
superposed on the data (where Iteration 0 corresponds to the initial
randomly
chosen weights)
Iteration k: plot the current perceptron decision boundary superposed
on the data
Iteration 2k: plot the current perceptron decision boundary superposed
on the data
Iteration 3k: plot the current perceptron decision boundary superposed
on the data
and so on until convergence.
You should now again test your learning algorithm on the 2 data sets
for the Assignment and plot figures of the decision boundary superposed
on the data at (for example) every 50 iterations (i.e., k=50). You
should
be able to see the decision boundaries improving as the algorithm
converges.
Lets say we calculate the change in mean squared error (MSE) and the error has decreased. In this case, we adjust the learning rate as follows:
nu <- beta x nui.e., we multiply the current learning rate by beta to get a new (larger) learning rate.
If, however, NMSE has not decreased we adjust the learning rate as follows:
nu <- delta x nui.e., we multiply the current learning rate by delta to get a new (smaller) learning rate. The hope is that by reducing the learning rate, we can take a smaller step and reduce the MSE.
The simplest way to perform this update is after each iteration, i.e., start out with some initial learning rate, and adjust it after one pass through all the examples. An alternative is to adjust the rate more often (one could adjust it after each example), although adjusting too often may slow down the algorithm since it will require calculating the MSE each time.
The automated adjustment of the learning rate should be enabled by setting an "auto_adjust" parameter to 1. If this flag is set to 0, then there is no adjustment and the same initial value of nu is used throughout learning.
You can experiment with this method of automatically setting the
learning
rate. For example, on a single plot, plot the MSE as a function of
number
of iterations, with both the auto-adjust turned and turned-off (i.e.,
there
will be 2 curves).
Another related general idea is called "line-search". Calculate the
direction to move using the gradient in the usual manner. Then move a
small amount epsilon in that direction and compute the MSE at the new
point in weight-space. Continue to move by small steps epsilon (or
multiple of epsilon) along the same "line" of direction (the same
gradient direction) until the point that the MSE increases: at that
step the algorithm "backs up" to the previous point. A new gradient
("line direction") is then computed at this point, and the algorithm
continues. The main difference between this and the other approach
above is that here we keep moving "along the same line" until the MSE
increases, whereas above we recompute the direction to move (the
gradient) after each step.