OligoZip: A Tool for Reconstructing Genome Sequences Using Illumina-Solexa or Similar Sequencing Data.

Softberry's new OligoZip tool for processing short reads generated by Solexa sequencing machines provides effective solutions to the following tasks:

  1. De novo reconstruction of genomic sequence;
  2. Reconstruction of genomic sequences based on a reference genome from same or close species;
  3. Mutation profiling and SNP discovery in a given set of genes.

OligoZip tool uses L-plets hashing technique to achieve fast data processing, and it takes into account reads quality information.

(1) De novo sequence reconstruction was tested on assembling several phage and bacterial genomes and was demonstrated to have superior clustering power compared to earlier published approach (Bioinformatics, 2007, 23(4):500-501): Simulated error-free 25mers of bacteriophage PhiX174 and coronavirus SARS TOR2 were assembled perfectly; and on Haemophilus influenzae genome, contigs assembled by OligoZip were almost twice as long as those assembled by published SSAKE software.

(2) To test reconstruction of bacterial sequences using reference genome, we assembled genomic sequence of Methanopyrus kandleri TAG11 on known genome of Methanopyrus kandleri AV19. Solexa reads, about 6 million each for AV19 and TAG 11, were produced by sequencing lab of Harvard Partners HealthCare Center for Genetics and Genomics. AV19 genome itself was assembled perfectly, with one extra contig that happened to be genome of phage φX174. TAG 11 reads were assembled into several hundred contigs. Similar results were achieved on five other Methanopyrus stains. The following link shows alignment of a fragment of one of assembled contigs to a reference genome in Softberry Genome Comparison Viewer.

Annotation of aligned parts of reference and assembled genomes by automatic FGENESB pipeline produces similar results (see link), indicating that no ORF distortions such as frameshifts or premature termination codons were introduced in the process of sequence assembly.

(3) As an example SNP finding using Solexa reads, we took reads of a population sample of a fragment of human EPS8 gene with known C-> T substitution, marked with an asterisk on a sequence here link. Approximately 40% of 268 reads mapped to this fragment, support occurrence of C->T SNP. This demonstrates possibility of using OligoZip for SNP discovery on Solexa data.

OligoZip is currently being further improved in order to increase average length of assembled contigs and resolve some issues with repeated sequences.


Solovyev et al. (2008) SeqZip . a tool for reconstruction of genome sequences using Solexa/Illumina machine data. The Pacific Symposium on Biocomputing 2008 (see poster PDF).

