Pointers to  Software and Data Sets, ICS 278, Data Mining

Spring 2006

 

Publicly-Available Software

WEKA software, JAVA-based package containing a variety of data mining/machine learning algorithms. 

R statistical computing environment: powerful environment for statistical computing, widely used by statistical researchers.

MATLAB user-contributed programs: a small subset of the many functions and scripts available on the Web for MATLAB.

Statlib/MATLAB routines: some MATLAB programs available from Statlib, e.g., the edatoolbox is quite useful.

KDNuggets software list: extensive pointer to software packages for data mining and machine learning (mostly commercial, but some free).

Software for Graphical Models/Bayesian networks: very comprehensive (as of April 2006) comparison of different software packages for graphical models.

JUNG: open-source project for JAVA code for graph/network analysis and visualization.

SVMlight: widely used and very efficient implementation of SVM algorithms.  Here are some other links to SVM software packages.

Topic Modeling for Text Documents: topic modeling code from Mark Steyvers and Tom Griffiths in MATLAB.  Also David Blei's impementation of LDA in C, and Yee Whye Teh's hierarchical Dirichlet process modeling code.

MALLET: comprehensive JAVA-based software for statistical natural language processing.

Publicly-Available Data Sets

UCI Machine Learning Archive: widely-used testbed of data sets for machine learning and data mining - mostly relatively small and classification-oriented data sets.

UCI KDD Archive: contains somewhat larger, more complex, data sets than those found in the UCI ML archive.

StatLib: contains pointers to many data sets used in statistics.

KDNuggets Data Sets list: pointers data sets and archives of data sets