DNAzip: DNA sequence compression using a reference genome
The demonstration code can be found here DNAzip.
The code is written in standard C++. We have used GNU C++ to compile
in the provided makefile, but other C++ compilers should work. The
resulting executable program is "perftest", currently as a
demonstration program it is hard-coded to look for specific reference
genome, reference SNP map, and genome to be compressed.
To run the code, you will need the following two datasets:
- The human reference genome
(hg18, 840MB).
This is the exact same data from UCSC
goldenPath,
you will likely achieve faster download directly from UCSC. There are
25 FASTA files, one for each chromosome and the mitochondrial genome
(which typically isn't used), that should be uncompressed and placed
in a "chr" subdirectory under the source code.
- dbSNP
database, 140MB, used as the refernece SNP map. The SNP data is
based upon NCBI dbSNP build 129, but we use the UCSC SNP track as it
has pre-processed all of the needed data into a single table. The UCSC
Table Browser
can be used to acquire the SNPs (group: Variation and Repeats, track:
SNPs (129), table: snp129). There are 25 FASTA files, one for each
chromosome and the mitochondrial genome. These should be in a "dbSNP"
subdirectory under the source code.
- James Watson's
genome,
418M, used as the example genome to be compressed. The data is from
the Nature
publication and was kindly provided by David Wheeler. There are
two files, one which lists all of the indels (JWB-indels-submission)
and the other which lists the SNPs (JWB-snps-submission.txt). These
should be in a "files" subdirectory under the source code.
All files generated by the "perftest" executable will be created in
the "files" subdiretory. The program compresses the genome then
uncompresses it. The following files will be created:
- JWB-unified-file.txt: Post-processing of the variation data for JW
genome into a more compact text representation; this removes the
unneeded fields from the data set. This data is the basis for
compression and is ~84MB for JW genome.
- JWB-in-posFreq4.txt: The 4-mer frequency table for JW genome.
- JWB-unified-compression.txt: This is the resulting compressed
genome, it should ~4MB for JW genome.
- JWB-unified-DeCompression.txt: This is the resulting decompressed
genome, it should be equivalent to the original file
(JWB-unified-file.txt).
Future work
We have plans to enhance the code into a more flexible genome
compression library.
Any questions about use of this code should be directed
to Xiaohui Xie
or Chen Li
For citation, please refer to the following paper
Human genomes as email attachments, Christley S, Lu Y, Li C, and Xie X, Bioinformatics. 2009 25:274-5. It was the most downloaded article on the Web site of the Journal of Bioinformatics for two months.
Additional info
Funding
Development of DNAzip is partially supported by funding from National Science Foundaton.