DNAzip: DNA sequence compression using a reference genome

The demonstration code can be found here DNAzip.

The code is written in standard C++. We have used GNU C++ to compile in the provided makefile, but other C++ compilers should work. The resulting executable program is "perftest", currently as a demonstration program it is hard-coded to look for specific reference genome, reference SNP map, and genome to be compressed.

To run the code, you will need the following two datasets:

All files generated by the "perftest" executable will be created in the "files" subdiretory. The program compresses the genome then uncompresses it. The following files will be created:

Future work

We have plans to enhance the code into a more flexible genome compression library.

Any questions about use of this code should be directed to Xiaohui Xie or Chen Li

For citation, please refer to the following paper

Human genomes as email attachments, Christley S, Lu Y, Li C, and Xie X, Bioinformatics. 2009 25:274-5. It was the most downloaded article on the Web site of the Journal of Bioinformatics for two months.

Additional info


Development of DNAzip is partially supported by funding from National Science Foundaton.