UCI Source Code Data Sets

Welcome to the UCI Source Code Data Sets

This page is a repository of various data sets we have curated in our research in large scale analysis of source code. These data sets are available for other researchers and individuals to use. Please refer to the terms of usage that come with each data set for any restrictions in usage.

Currently available data sets:

  • [04-22-2010] SDS_source-repo-18k is a tarball of Sourcerer Code Repository archived on 04-22-2010. It contains 18,000 java projects (~390GB).
  • [06-04-2010] Koders-log-2007 is a compressed file (in 7z format) containing a Microsoft SQL Server database backup storing a yearlong usage data of Koders.com, an Internet-scale code search engine. (~188MB).
  • [11-14-2013] sourcerer-maven-aug12 is a compressed tarball (.tar.gz) containing 2,232 projects from the Maven Central repository (~80GB).

Questions, Issues and More Information

Please use the issue tracker in github.

Citation Policy

If you publish material based on data sets obtained from this repository, then, in your acknowledgments, please note the assistance you received by using this repository. This will help others to obtain the same data sets and replicate your experiments. We suggest the following pseudo-APA reference format for referring to this repository:

C. Lopes, S. Bajracharya, J. Ossher, P. Baldi (2010). UCI Source Code Data Sets [http://www.ics.uci.edu/~lopes/datasets]. Irvine, CA: University of California, Bren School of Information and Computer Sciences.

Here is a BiBTeX citation as well:

    @misc{Lopes+Bajracharya+Ossher+Baldi:2010 ,
    author = "C. Lopes and S. Bajracharya and J. Ossher and P. Baldi",
    year = "2010",
    title = "{UCI} Source Code Data Sets",
    url = "http://www.ics.uci.edu/$\sim$lopes/datasets/",
    institution = "University of California, Irvine, 
       Bren School of Information and Computer Sciences" }


This work has been partially supported by the National Science Foundation.

(c) the mondego group