Classification and discovery of local patterns in transactional data

Darya Chudova


Introduction

This is a summary of a final project for ICS 278 (Data Mining) at UC Irvine, taught by Professor Padhraic Smyth in Fall 1999. For more details on this project please read the full final project report.


Background on the Data Set

The transactional data I used for the project reflect the purchases that occurred during the two year period in the large chain retail stores. There is a total of ~500,000 transactional records that correspond to ~190,000 different customers. The data were supplied by one of the commercial firms and is thus considered confidential.


Description of the Task

I aimed at classification and discovery of local patterns in the transactional data. The labels for classification were assigned based on how often the customer made purchases in the store over a two year period. The input data were constructed as a frequency of purchases made by each customer in a department or alternatively as a relative customer's spending in the department. The data set consisted of 8150 examples with 55 continuous attributes. The input data were built using a subset of 50000 transactions.


Description of the Data Mining Algorithm

I used decision tree induction algorithm C5.0 available as an ICS module. C5.0 decision trees create piece wise constant decision boundaries in the original attribute space. I also used C5.0 to produce a set of local ruled based on the decision trees.


Experimental Methodology

I used cross validation to estimate performance of the classifier on the unseen data. For a total  of 8150 examples, I used 10 folds, with each fold preserving the proportion of negative and positive examples found in the original full data set. To estimate the performance of decision trees, I looked at CV error, standard deviation of CV error, false negative, and false positive rate. To estimate the quality of local rules, I looked at the total number of rules, rule length, and rule confidence.


Results and Interpretation

Classification

High generalization ability of classifier is supported by the following results:

Local Rules See also possible extensions of work


A full archive of the project can be found at 278Report.zip