Assignment 3: CS 175: Project in Artificial Intelligence

Due at 9:30am Thursday October 18th 2007


Instructions for Assignment 3

In this assignment you will construct a working learning algorithm, which can learn a classification model to classify feature vectors into 2 classes. The classification learning algorithm you will implement is the incremental gradient descent perceptron learning algorithm that we will be covering in class. In part 1 you will implement a function that calculates the perceptron output and error, given a vector of perceptron weights and a data set.  In part 2 you will write a simple function to visualize these weights. In part 3 you will design and implement the incremental gradient descent perceptron learning algorithm and in parts 4 and 5 you will develop some simple functions to display the learning capabilities of your algorithm. Part 6 is an optional extra part of the assignment, where you can  investigate an automated method for adjusting the learning rate online: we will not grade your code for this part of the assignment (however, feel free to turn it in and include it in your report if you wish).

For the experiments below you will use the same type of data we used to train the nearest neighbor classifier in Assignment 2. The data can be found in the files sampledata1.mat and sampledata2.mat. Sampledata1 contains a dataset of size 20 x 2, with a vector of targets of size 20 x 1 (with values +1 and -1, corresponding to the 2 classes), and sampledata2 contains a dataset of size 40x2 and targets of size 40 x 1.
 

Part 1: Calculating Perceptron Outputs and Errors

You will implement a function that takes a set of weights and set of feature vectors and returns the classification decisions as calculated by the perceptron using those weights. This is a simple function. Your weight vector should have dimensions 1 x (d+1), i.e., a row vector with d+1 elements. Here d is the number of features, and components 1 to d of the weight vector correspond to features 1 to d. The d+1th component of the weight vector is the weight for the constant input (where the constant input is set to 1).

The other argument to this function is a matrix called "data", of dimension N x (d+1), where the first d columns are the d different features and the (d+1)th column is a column of all 1's, for the constant input. It is probably useful if you just define this "augmented" version of the feature data (with the extra column of 1's) early on in your perceptron learning experiments and functions (you can just define it at the MATLAB prompt).  N is the total number of training examples.

Your function needs have the following header information (you need to give it this exact name and have these arguments in this order).
 

  function [thresholded_outputs] = perceptron(weights,data)
  % function [thresholded_outputs] = perceptron(weights,data)
  %
  %  brief description of what the function does
  %  ......
  %                            Your Name, CS 175, date
  %
  %  Inputs
  %     weights: 1 x (d+1) row vector of weights
  %     data:  N x (d+1) matrix of training data
  %
  %  Outputs
  %     thresholded_outputs: N x 1 vector of thresholded perceptron outputs (each entry must be +1 or -1)
 
      --------   Your code goes here -------
 

 
 
 

Part 2: Calculating the Classification Error and Mean Square Error (MSE) for a Perceptron

The next function to be implemented takes a vector of weights and a set of data vectors and targets as described earlier, and then calculates both the mean-square error and the classification error of the perceptron outputs when compared to the target values (we will be discussing in class how these are calculated). For the mean-square error calculation you use "unthresholded" outputs, namely sigmoid(weights * inputs) as discussed in class. For the classification error you threshold these real-valued numbers (to get a class prediction, +1 or -1) and then compare this to the true targets. For perceptron models, the true targets will be +1 and -1 (this will be assumed by the function below - if you wish you could add some code to check that the targets provided as inputs do indeed only take valus +1 or -1).. If you wish this function can call perceptron.m

Your function needs have the following header information (you need to give it this exact name and have these arguments in this order).
 

 function [cerror, mse] = perceptron_error(weights,data,targets)
  % function [cerror, mse] = perceptron_error(weights,data,targets)
  %
  %  brief description of what the function does
  %  ......
  %                            Your Name, CS 175, date
  %
  %  Inputs
  %     weights: 1 x (d+1) row vector of weights
  %     data: N x (d+1) matrix of training data    
  %     targets:  N x 1 vector of target values (+1 or -1)
  %                                   
  %  Outputs
  %     cerror: the percentage of examples misclassified (between 0 and 100), using thresholded perceptron outputs
  %     mse:  the mean-square error (sum of squared errors between the targets and the unthresholded outputs, divided by N)
 
      --------   Your code goes here -------
 

 
 
 

Part 3: The Perceptron Learning Algorithm, using Incremental Gradient Descent Learning

You are to implement a MATLAB function which "trains" a perceptron classification model based on incremental gradient descent as discussed in class. To summarize, the pseudocode for this algorithm is roughly as follows (please also refer to our discussion in class and class lecture notes for Lectures 5 and 6):
   Initialize each weight to a small randomly chosen value
   iteration=0;
   While (convergence_criterion  not achieved)
        for i=1:N
        calculate the output of the network for input i
                for j = 1: d+1
                        update weight j (see class slides for equation)
                end
        end
        calculate convergence_criterion
        ++iteration
        (optional) plot current location of decision boundary
   end

This is fairly high-level pseudo-code. The details on how to update the weights and how to define the convergence criterion will need to specified by you.
 
function [weights,mse,acc] = learn_perceptron(data,targets,rate,threshold,init_method,random_seed,plotflag,k)
% function [weights,mse,acc] = learn_perceptron(data,targets,rate,threshold,init_method,random_seed,plotflag,k)
%
% brief description of what the function does
% ......
% Your Name, CS 175, date
%
% Inputs
% data: N x (d+1) matrix of training data
% targets: N x 1 vector of target values (+1 or -1)
% rate: learning rate for the perceptron algorithm (e.g., rate = 0.001)
% threshold: if the reduction in MSE from one iteration to the next is *less*
% than threshold, then halt learning (e.g., threshold = 0.000001)
% init_method: method used to initialize the weights (1 = random, 2 = half
% way between 2 random points in each group, 3 = half way between
% the centroids in each group)
% random_seed: this is an integer used to "seed" the random number generator
% for either methods 1 or 2 for initialization (this is useful
% to be able to recreate a particular run exactly)
% plotflag: 1 means plotting is turned on, default value is 0
% k: how many iterations between plotting (e.g., k = 100)
%
% Outputs
% weights: 1 x (d+1) row vector of learned weights
% mse: mean squared error for learned weights (sum of squared errors divided by N)
% acc: classification accuracy for learned weights (percentage, between 0 and 100)

-------- Your code goes here -------


The input argument "rate" specifies the learning rate of the algorithm. As discussed in class, you may have to experiment a little to find a good setting for the rate for a given problem. If the rate is too small, the algorithm may converge *very* slowly. If the rate is too large it may not converge at all since it may take steps which are too large. Rate values between 0.001 and 0.0001 work reasonably well for the data sets in sampledata1 and sampledata2. Note that if the rate is too large, the weights can quickly diverge to very large values (you will see the mean-squared error grow very quickly), indicating that you should try a smaller value. The mean square error from iteration to iteration will *increase* rather than decrease when this happens, and you should put a check in your algorithm so that it halts if the mean square error increases.

There are several different ways to define a convergence criterion for the algorithm. Essentially one wants to halt the learning once it appears that the perceptron has settled into a global minimum of the error surface. This will be reflected by the fact that the mean square error is hardly changing from one iteration to the next. One simple way to implement this is to compare the value of the mean-squared error on one iteration to the value at the previous iteration. If the decrease in error is less than the input argument "threshold", then halt, where "threshold" is a very small threshold value which we can set. The smaller the threshold is, the more stringent we are in determining convergence and the more iterations the algorithm will take. A threshold value of  0.00001 or smaller  is fairly typical.

As discussed in class, we need to select an initial set of weights for the perceptron to start the learning process. To help you with this, you can call the function initialize_weights175.m. You will need to figure out how to call this function (it is quite simple). Basically it gives you 3 different ways to initialize the weights: method 1 is a random set of weights, method 2 selects 2 points randomly (1 from each class) and uses weights that define a decision boundary half way between these points, and method 3 is like method 2 except that instead of 2 random points it selects the centroids (or 2-dimensional mean) from each class. You can experiment with these methods and will find out that there can be quite a difference in how quickly the algorithm converges to a solution depending on where it is started from.

The input argument  "random_seed" is an integer that specifies the value which is used by the function to seed the pseudo-random number generator. The random number generator in initialize_weights175.m uses this number in selecting the initial weights (in methods 1 and 2), so specifying the same seed in two different invocations of the algorithm will generate the same set of initial weights (allowing us to repeat exactly a particular experiment).

You can ignore the input arguments plotflag and k until we get to Part 5 below.
 
 

Part 4:  Plotting the Rate of Convergence of the Algorithm

From plotting the data in Assignment 2 you should have found that the data in sampledata1.mat  is linearly separable and that the data in sampledata2.mat is not. So we expect that the perceptron can learn to classify simdata1 with zero error, but will not be able to get zero classification error on simdata 2.

Now further modify your function learn_percptron.m so that when plotflag=1, your function plots (on 2 separate figures, after the algorithm has converged) the value of the mean-square error function and the value of the classification accuracy, as a function of the iteration number, i.e., a graph where the x-axis is the number of iterations (going from 1 to the total number of iterations taken by the algorithm) and the y-axis is the mean-square error function or the classification accuracy (one for each graph). To do this you will need to store in an array the value of the mean-squared error, and the accuracy, at each iteration as learning proceeds. This will allow you to see how the perceptron converged to a solution, for a given data set and a given set of control parameters (such as the learning rate). If you wish, include on the plot a text string that prints out the name of the data set being used, the value of the learning rate, the convergence threshold, and any other information you like (this will make it easier to read the plots when you start generating multiple plots for different data sets and different learning rates).
 
 

Part 5: Visualizing the Decision Boundaries

In this part of the assignment you are to further modify your learn_perceptron.m function from above to now be able to plot the location of the decision boundary for the weights every kth iteration.  To assist you in doing this I have provided a simple function called weightplot175.m. You will need  to figure out how to call this function to draw a decision boundary for the perceptron and to show the training data (it is quite straightforward to do this), and then figure out how to call it from within your learn_perceptron.m function, and finally you only want to call it every kth iteration (e.g., k=100), the reason being that if it takes 5000 iterations to converge, you don't want to plot the position of the decision boundary for all 5000 iterations.  This may require some experimentation in MATLAB.

Specifically, if plotflag=1, your code should do the following (in addition to the plots from Part 4)
Iteration 0: create a figure and plot the perceptron decision boundary superposed on the data (where Iteration 0 corresponds to the initial randomly chosen weights)
Iteration k: plot the current perceptron decision boundary superposed on the data
Iteration 2k: plot the current perceptron decision boundary superposed on the data
Iteration 3k: plot the current perceptron decision boundary superposed on the data
and so on until convergence.
 

You should now again test your learning algorithm on the 2 data sets for the Assignment and plot figures of the decision boundary superposed on the data at (for example) every 50 iterations (i.e., k=50). You should be able to see the decision boundaries improving as the algorithm converges.
 
 
 
 

(Optional) Part 6: Automatically Adjusting the Learning Rate

As discussed in class, there is no general theory on what the learning rate should be. Generally speaking, the theory suggests that it should decrease as learning proceeds. One simple heuristic idea is as follows: let delta be some constant between 0 and 1 (delta=0.5 is a useful starting point), and let beta be some number slightly greater than 1 (beta = 1.1 is often used).

Lets say we calculate the change in  mean squared error (MSE) and the error has decreased. In this case, we adjust the learning rate as follows:

     nu  <-   beta x nu
i.e., we multiply the current learning rate by beta to get a new (larger) learning rate.

If, however, NMSE has not decreased we adjust the learning rate as follows:

     nu  <-   delta x nu
i.e., we multiply the current learning rate by delta to get a new (smaller) learning rate. The hope is that by reducing the learning rate, we can take a smaller step and reduce the MSE.

The simplest way to perform this update is after each iteration, i.e., start out with some initial learning rate, and adjust it after one pass through all the examples. An alternative is to adjust the rate more often (one could adjust it after each example), although adjusting too often may slow down the algorithm since it will require calculating the MSE each time.

The automated adjustment of the learning rate should be enabled by setting an "auto_adjust" parameter to 1. If this flag is set to 0, then there is no adjustment and the same initial value of nu is used throughout learning.

You can experiment with this method of automatically setting the learning rate. For example, on a single plot, plot the MSE as a function of number of iterations, with both the auto-adjust turned and turned-off (i.e., there will be 2 curves).

Another related general idea is called "line-search". Calculate the direction to move using the gradient in the usual manner. Then move a small amount epsilon in that direction and compute the MSE at the new point in weight-space. Continue to move by small steps epsilon (or multiple of epsilon) along the same "line" of direction (the same gradient direction) until the point that the MSE increases: at that step the algorithm "backs up" to the previous point. A new gradient ("line direction") is then computed at this point, and the algorithm continues. The main difference between this and the other approach above is that here we keep moving "along the same line" until the MSE increases, whereas above we recompute the direction to move (the gradient) after each step.


What to Turn In