Project Report for ICS
278, Fall 1999
Instructor: Prof. Padhraic
Smyth
Data Description
I worked with transactional data that were supplied by one of the commercial firms and were thus considered confidential. The data reflect the purchases made during two years at a large chain of retail department stores. The information available for each transaction includes:
Task Definition
I used the information on how frequently the customers made purchases in each department or how much money they spent at each department as attributes in my classification task. For any given customer I defined purchasing frequency profile and spending profile, which is described in more detail below. I used C5.0 decision tree inducer to
Algorithm Specification
I used decision tree induction algorithm from the MLC++ library available at UCI. The original paper Induction of Decision Trees by Quinlan describing ID3 algorithm dates back to 1985 and since that time decision tree algorithms are widely used in practice. There has been a wide range of refinements made for the tree induction algorithms. A more recent and comprehensive work that is considered classical in decision trees is Classification and Regression Trees (CART) by Breiman, Friedman and Olshen (1994). MLC++ library contains implementations of the C4.5 and C5.0 decision tree algorithms that both use sophisticated pruning mechanisms.
The high level outline of the algorithm on the continuous attributes is as follows:
Learn Decision Tree (Tree &T){
Grow Tree (&T);}
Prune Tree (&T);
Grow Tree (Tree &T){
While (Not All Leafs of T are Pure){
In the Grow Tree procedure the purity of the leaf node could be judged by for example the entropy or the GINI index. The tree is grown till all training patterns are correctly classified. Once the tree is grown and the classification labels are assigned to its leaves it should be pruned and this is exactly the purpose of the Prune Tree routine. The idea of pruning is that we might sacrifice some of the correctly classified patterns in the training data for the simplicity of the tree, which will make it more biased. A lot of heuristics have been proposed to do the pruning, the most complicated in my opinion being the one proposed by Breiman et al is based on the idea of the internal cross validation.
The main advantage of the decision trees that immediately makes it the number one technique in such spheres as, for example, medical diagnostics and credit card applications evaluation is the interpretability of the created model. Namely, it is easy to read the rules of the type if the FOREIGN_STUDENT=YES and YEARS_IN_COUNTRY < 2 then CREDIT_CARD=NO from the tree. In the example above I used attributes FOREIGN_STUDENT (takes 2 values YES and NO) and YEARS_IN_COUNTRY (integer) and CREDICT_CARD is a classification label taking 2 values YES and NO. It becomes even more important for a doctor in the ER to know how the system came to a certain conclusion but not the final conclusion on its own. Now take for example neural networks that are known to be able to approximate arbitrary function with any prespecified accuracy. Note that decision trees wont have such ability since they are only capable of creating piece wise constant decision boundaries. Neural networks in general will be much more accurate but absolutely uninterpretable which tremendously limits their applicability. In my project Id like to be able to come up with an interpretable model and this clearly makes decision trees the number one choice.
Finally, even if it might not be possible to learn a tree with a satisfying global accuracy one could look at the local regions in the original state space returned by a decision tree where classification can be done with satisfactory accuracy. I considered such an example in my original proposal.
Decision trees also have some limitations that one should be aware of. The trees are usually grown so that the most informative attributes are selected first and the growth continues till the tree that makes no classification errors. The attributes and their splitting values are selected greedily and there is no guarantee that the true best attribute and its splitting value will be selected. Secondly, the complexity of learning is slightly higher than linear in the parameters of the data (namely, N*(log N) where N is the number of data points), so it might be problematic to scale it up for large data sets.
Data Preprocessing
Defining Customer Profile
I considered two sets of input data for classification and rule finding. In both cases, one training pattern corresponds to aggregated information about all transactions of a given customer.
The first data set contains each customers purchasing frequency profiles. For this data set, attribute i, i =1,.., Ndep of training pattern p, p = 1, .., Ncust , fip was calculated as a ratio of number of items purchased by customer p in department i to the total number of items purchased by customer p.
The second data set contains customers spending profiles. For this data set, attribute i, i =1,.., Ndep of training pattern p, p = 1, .., Ncust , Pip was calculated as a ratio of amount spent by customer p in department i to the total amount spent by customer p.Generally speaking, the profiles are defined as relative frequency of purchasing in each department and relative spending in each department respectively. The total number of departments in both data sets Ndep was equal to 55.
I worked with a subset of all transactions in the database, that corresponded to a total of 50000 transactions. Profile based data sets contained Ncust = 8150 patterns.
Frequency Profile Vs. Spending Profile
In order to understand possible differences in the predictive power of each data set, I studied the correlation between corresponding columns in both sets (correlation between frequency of making a purchase in the department and relative spending in the department). The results summarized in a chart below not surprisingly show very high correlation and so the classification accuracy based on two data sets was expected to be very similar. Note, that correlation with a value of 0 here denotes departments with constant frequency / spending across all customers (actually, zero frequency and spending). This might be a poor representation as results may appear deceiving at the first sight...
Fig.1 Frequency / Spending profiles correlation
Data Analysis Results
I used the data sets described above to generate different splits into
training / test sets, and to perform 10-fold cross validation for decision
tree induction algorithm. I was interested in the following cross validated
parameters: average classification error, standard deviation of the CV
error, average number of false positives (infrequent customers classified
as being frequent) and false negatives (frequent customers classified as
infrequent). The last two parameters were of interest since the ratio of
the number of positive examples to the number of negative examples in the
original data set was 0.16. When doing cross validation, I preserved this
ratio within each test / train set.
Classification Results
The experiments show very good performance in terms of the overall accuracy.
An important factor here is not only the value of the error itself, but
also a low standard deviation of error across different folds. This shows
the stability of the classification quality on different train / test splits.
In terms of estimating the accuracy of the classifier I think it would
be better to look at the rate of false positives / false negatives rather
than the overall error because of the unbalanced training / test sets.
The highest fraction of error appears to be the false positive rate (~12%),
which stands for infrequent customers incorrectly classified as frequent.
Note, that missing a frequent customer in this classification is a relatively
rear event (~3.2%), which might be more important in marketing applications,
where there is a need to identify a set of potentially frequent customers
without missing any. Frequency profile showed a slightly better performance
across all folds, which could be attributed to the presence of additional
(higher order) information in these data, not captured by the correlation
analysis above.
| CV Error | Standard Deviation of CV Error | CV False Positive | CV False Negative | |
| Frequency Profile | 4.34% | 0.77% | 12.07% | 3.17% |
| Spending Profile | 4.71% | 0.66% | 12.71% | 3.53% |
Table 1. Cross validated error of decision tree classifier
Local Rules
On average, the algorithm produced 45 rules when trained to classify a set of over 7000 training patterns. Out of those 45 rules, 13 identify infrequent customers. Total number of extracted rules was fairly low and in my opinion this means that the differences between frequent and infrequent customers can be found on a generic enough level. Along with the low CV classification error and low variation of the error across the folds, it allows to conclude that there is a significant amount of predictive power in the input data.
I analyzed frequency of occurrence of each department in the set of rules, i.e. the number of times each department was used as a part of a rule. The most frequent departments in this experiment appeared to be "LADIES SHIRTS" (occurs in 15 rules) and "SHOES" (occurs in 10 rules). It is interesting to note, that "SHOES" department occurred as "has purchases in the department" / "doesn't have purchases in the department" binary test in 9 out of 10 rules. "LADIES SHIRTS" occurred as a binary test in 11 out of 15 rules.
Fig. 2 Occurrence of departments in classification rules
Another interesting parameter of produced rules is their length. The
shorter rules can be viewed as more generic, and the rules produced based
on Customer Spending Profile appear not to involve more than 3-4 departments.
This opens a way to researching methods to identify "sufficient portions"
of customer profile, that could provide enough information to classify
a customer into a particular category. It may happen that a narrow subset
of customer characteristics would be sufficient for a successful classification.
Fig.3 Rule Length (Customer Spending profile)
The last aspect of the rules produced by C5.0 that I'd like to cover
here is the confidence level for each rule. 24 rules out of 45 had a confidence
level greater than 0.95. Only a couple of rules (they appeared to be the
ones with just one department as a left-hand side) had a confidence level
of less than 0.8.
Fig.4 Rule Confidence levels
The work done in the frameworks of the project may be viewed as a basis for further research, especially in the area of extracting local patterns from transactional data. This project was solely devoted to discrimination between frequent and infrequent customers. There might be many more tasks of practical interest posed with respect to this database, for example which items are most often purchased together and so on.
There are also some generic open questions about the quality of extracted rules. These questions include: