Download page for Sourcerer Source Code Data Set: SDS_source-repo-18k

"SDS_source-repo-18k" is a tarball containing source code of about 18,000 open source Java projects that were collected from open source repositories and stored in Sourcerer's repository format. We are releasing this tarball so that this repository could be used as a reference collection of open source projects to be used for various research purposes. This tarball was archived on 04-22-2010.

"SDS_source-repo-18k" is a part of the UCI Source Code Data Sets.


By downloading and using the Sourcerer repository, you agree to abide by the following terms of usage.

  1. The source code contained in the tarball are collected from various open source projects and you should adhere to the respective licenses that come with the projects.
  2. You will use the file strictly for non-commercial and non-profit work (eg; research or personal use). Any commercial use of this file is prohibited.

Citation Policy

This data set should be cited according to the general Citation Policy.

Publications relevant to this data set

  1. S. Bajracharya, J. Ossher, and C. Lopes. Sourcerer: An internet-scale software repository. In Proceedings of the 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation, pages 1-4. IEEE Computer Society, 2009.
    Author = {Sushil Bajracharya and Joel Ossher and Cristina Lopes},
    Booktitle = {Proceedings of the 2009 {ICSE} Workshop on {Search-Driven} 
      {Development-Users,} Infrastructure, Tools and Evaluation},
    Pages = {1--4},
    Publisher = {{IEEE} Computer Society},
    Title = {{Sourcerer: An internet-scale software repository}},
    Year = {2009}}
  2. Linstead, E., Bajracharya, S., Ngo, T., Rigor, P., Lopes, C. V., Baldi, P. (2009). Sourcerer: Mining and Searching Internet-Scale Software Repositories. Journal of Data Mining and Knowledge Discovery, 18(2), 300-336.
    journal={Data Mining and Knowledge Discovery},
    title={Sourcerer: mining and searching internet-scale software repositories},
    publisher={Springer US},
    keywords={Mining software; Program understanding; Code search; Software analysis; 
              Author-topic probabilistic modeling; Code retrieval},
    author={Linstead, Erik and Bajracharya, Sushil and Ngo, Trung and Rigor, Paul and Lopes, Cristina and Baldi, Pierre},
  3. Baldi, P., Lopes, C. V., Linstead, E., Bajracharya, S. (2008). A Theory of Aspects as Latent Topics. In Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). (pp. 543-562).
     author = {Baldi, Pierre F. and Lopes, Cristina V. and Linstead, Erik J. and Bajracharya, Sushil K.},
     title = {A Theory of Aspects As Latent Topics},
     booktitle = {Proceedings of the 23rd ACM SIGPLAN Conference on Object-oriented Programming 
                  Systems Languages and Applications},
     series = {OOPSLA '08},
     year = {2008},
     isbn = {978-1-60558-215-3},
     location = {Nashville, TN, USA},
     pages = {543--562},
     numpages = {20},
     url = {},
     doi = {10.1145/1449764.1449807},
     acmid = {1449807},
     publisher = {ACM},
     address = {New York, NY, USA},
     keywords = {aspect-oriented programming, scattering, tangling, topic models},

Related Datasets

See CWI's version of this corpus and additional artifacts produced from it, including complete ASTs.

(c) the mondego group