Classification and discovery of local patterns in transactional data
Introduction
This is a summary of a final project for ICS 278 (Data Mining) at UC Irvine,
taught by Professor Padhraic Smyth
in
Fall 1999. For more details on this project please read the full
final project report.
Background on the Data Set
The transactional data I used for the project reflect the purchases that
occurred during the two year period in the large chain retail stores. There
is a total of ~500,000 transactional records that correspond to ~190,000
different customers. The data were supplied by one of the commercial firms
and is thus considered confidential.
Description of the Task
I aimed at classification and discovery of local patterns in the transactional
data. The labels for classification were assigned based on how often the
customer made purchases in the store over a two year period. The input
data were constructed as a frequency of purchases made by each customer
in a department or alternatively as a relative customer's spending in the
department. The data set consisted of 8150 examples with 55 continuous
attributes. The input data were built using a subset of 50000 transactions.
Description of the Data Mining Algorithm
I used decision tree induction algorithm C5.0
available as an ICS module. C5.0 decision trees create piece wise constant
decision boundaries in the original attribute space. I also used C5.0 to
produce a set of local ruled based on the decision trees.
Experimental Methodology
I used cross validation to estimate performance of the classifier on the
unseen data. For a total of 8150 examples, I used 10 folds, with
each fold preserving the proportion of negative and positive examples found
in the original full data set. To estimate the
performance of decision trees, I looked at CV error, standard deviation
of CV error, false negative, and false positive rate. To estimate the quality
of local rules, I looked at the total number of rules, rule length, and
rule confidence.
Results and Interpretation
Classification
High generalization ability of classifier is supported by the following
results:
-
High average CV accuracy
-
Low variance of classification accuracy across folds
Local Rules
-
Good generalization ability: an average of 45 rules for ~7000 patterns
-
Relatively short rules: most of the
rules don't involve more than 3-4 departments
-
High confidence of rules: half
the rules have a confidence greater than 0.95
See also possible extensions of work
A full archive of the project can be found at 278Report.zip