From Computer Data to Human Knowledge: A Cognitive Approach to Knowledge Discovery and Data Mining
Michael J. Pazzani
Department Name: Information and Computer Science
Institution: University of California, Irvine
Contact Information
Michael J. Pazzani
Information and Computer Science
444 Computer Science Bldg.
University of California
Irvine, CA 92697-3425
Phone: (949) 824-7405
Fax : (949) 824-4035
Email: pazzani@ics.uci.edu
http://www.ics.uci.edu/~pazzani
WWW PAGE
http://www.ics.uci.edu/~pazzani/CognitiveKDD.html
Project Award Information
Keywords
Knowledge Discovery in Database; Cognitive Science; Comprehensibility of Learned Models
Project Summary
The research is concerned with intelligent decision aids that can be developed by data mining techniques. Experience has shown that such systems can learn accurate models, but that experts in areas where those models are used in decision aids are often reluctant to trust them because they do not use the same tests, intermediate conclusions, or abstractions that the experts have grown to trust. Experts also want models that are stable under small changes in the data being analyzed. Psychologists have uncovered numerous factors that simplify the learning, understanding, and communication of category and process information by humans. This research seeks to explore these psychological principles in light of the output of existing KDD algorithms and to develop and evaluate new KDD algorithms that will provide output that is easy for people to learn, use, and communicate to others. With the results of this research, it should be possible to make such decision aids more "human centered", so that they will be used more often and more effectively in practice.
This is a joint project between Dorrit Billman (Georgia Institute of Technology) and Michael Pazzani (University of California, Irvine). Billman and Pazzani have complementary abilities that will produce the synergy of interdisciplinary work at its best, but also a common pool of assumptions and knowledge to facilitate communication and interaction. Pazzani is a computer scientist who has done psychology experiments, while Billman is a psychologist who has done computational work. This collaboration will bring cognitive principles into the field of Knowledge Discovery and Databases.
Goals, Objectives, and Targeted Activities
Our activities for the project this year fall under two main themes.
Indication of Success
The primary result this year is the creation of a constrained form of regression that produces linear models that generalize as well as those produced by multiple linear regression (see Table 1) yet produces results that subjects are more willing to use (see Table 2).
Table 1. Predictive Mean Squared Error of Regression Routines.
Database |
Multiple Linear Regression |
Independent Sign Regression |
Alzheimer |
0.184 |
0.166 |
Autompg |
10.6 |
10.5 |
Baseball |
8.74e+5 |
8.55e+5 |
CS Dept |
0.244 |
0.213 |
Housing |
23.7 |
27.6 |
Pollution |
3.53e+3 |
1.6e+3 |
Table 2. Average Subjects Ratings for Linear Equations.
Regression Algorithm |
Mean Rating |
Multiple Linear Regression |
-0.816 |
Independent Sign Regression |
0.603 |
Project Impact
Although this research project was initiated on Sept. 30, 1998, it has attracted the attention of several corporations; the PI has been invited to speak at Microsoft Research Laboratories in Redmond, WA and HNC Software in San Diego, CA.
GPRA Outcome Goals
We believe the Independent Sign Regression algorithm represents an advance that will be applicable to a wide variety of situations, particularly those in which people with little knowledge of assumptions behind data mining algorithms apply these algorithms to diverse data sets.
Project References
Area Background
The goal of data mining is to investigate algorithms for providing insight into some phenomenon by analyzing a database of examples of that phenomenon. The specific focus of our investigation is to constrain and bias algorithms for creating models of data so that these models are understandable and coherent to users of knowledge discovery and data mining systems.
Large databases are being collected in science, business, and medicine due to advances in methods for collecting, storing, and integrating data. The potential benefit of these rich information sources has scarcely been tapped and the societal effects scarcely envisioned. Not only could this provide new discovery methods in science and new decision-making tools for business, but also new bases for policy making in health or economics, new tools for medical diagnosis, new information about products for consumers, and a host of other possibilities.
In response to the availability of these very large databases, a variety of techniques have been developed and applied to recover useful information. These techniques have drawn from statistics, pattern recognition, machine learning, and neural networks to build models describing regularities in the data. The goal of this modeling is to help people understand the data by discovering predictive or descriptive models. However, to date research in data mining has not paid attention to the cognitive factors that make the resulting models coherent, credible, easy to learn, easy to use, and easy to communicate to others. Without attention to the human user, the social benefits of data mining cannot be fully realized.
We anticipate that principles of human learning and reasoning will guide the design of new data mining algorithms to produce models that are easier for users of KDD systems to understand and that properties of learning algorithms will add to the understanding of human psychological processes.
Area References
Davies, J. & Billman, D. (1996). Consistent Contrast in Unsupervised Learning. in Program of the Eighteenth Annual Conference of the Cognitive Science Society. Erlbaum: Hillsdale, NJ.
Draper, N. & Smith, H. (1981). Applied Regression Analysis. John Wiley & Sons.
Fayyad, U.M., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge Discovery: An Overview. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy (Eds.): Advances in Knowledge Discovery and Data Mining (pp. 1-34). AAAI/MIT Press.
Kelley, H. (1971). Causal schemata and the attribution process. In E. Jones, D. Kanouse, H. Kelley, N. Nisbett, S. Valins, & B. Weiner (Eds.), Attribution: Perceiving the causes of behavior (pp 151-174). Morristown, NJ: General Learning Press.
Murphy, G.L. & Allopenna, P.D. (1994). The locus of knowledge effects in concept learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, 203-222.
Pazzani, M. (1991a). The influence of prior knowledge on concept acquisition: Experimental and computational results. Journal of Experimental Psychology: Learning, Memory & Cognition, 17, 3.
Spiegelhalter, D., Dawid, P., Lauritzen, S. and Cowell, R. (1993). Bayesian Analysis in Expert Systems. Statistical Science, 8, 219-283.
Tufte, E.R. (1990). Envisioning Information. Connecticut: Graphics Press.