OligoZip: A Tool for Reconstructing Using Next-Generation Sequencing Technology Data
OligoZip general algorithm
Algorithm of ab initio genome assembling with the use of data produced by next-generation sequencing machines (Illumina/Solexa/etc). In this description, a group of assembled reads will be denoted as a reads "cluster", unused reads as “free reads” and a set of unused reads as a “free reads pool”. The algorithm begins with an empty cluster.
The clustering algorithm produces sequence contigs - ab initio or using a reference genome (see algorithm scheme ).
When use please reference:
Vorobyev D., Seledtsov I., Solovyev V. De novo assembling next generation sequences. http://linux5.softberry.com/cgi-bin/berry/programs/OligoZip
The set of programs of ab initio assembly:
Testing the system
(1) De novo sequence reconstruction was tested on assembling several phage and bacterial genomes and was demonstrated to have superior clustering power compared to earlier published approach (Bioinformatics, 2007, 23(4):500-501): Simulated error-free 25mers of bacteriophage jX174 and coronavirus SARS TOR2 were assembled perfectly; and on Haemophilus influenzae genome, contigs assembled by OligoZip were almost twice as long as those assembled by published SSAKE software.
(2) To test reconstruction of bacterial sequences using reference genome, we assembled genomic sequence of Methanopyrus kandleri TAG11 on known genome of Methanopyrus kandleri AV19. Solexa reads, about 6 million each for AV19 and TAG 11, were produced by sequencing lab of Harvard Partners HealthCare Center for Genetics and Genomics. AV19 genome itself was assembled perfectly, with one extra contig that happened to be genome of phage jX174. TAG 11 reads were assembled into several hundred contigs. Similar results were achieved on five other Methanopyrus stains. The following link shows alignment of a fragment of one of assembled contigs to a reference genome in Softberry Genome Comparison Viewer.
Annotation of aligned parts of reference and assembled genomes by automatic FGENESB pipeline produces similar results (see link), indicating that no ORF distortions such as frameshifts or premature termination codons were introduced in the process of sequence assembly.
(3) As an example of SNP finding using Solexa reads, we took reads of a population sample of fragment of human EPS8 gene with known C->T substitution, marked with an asterisk on a sequence here link. Approximately 40% of 268 reads mapped to this fragment support occurrence of C->T SNP. This demonstrates feasibility of using OligoZip for SNP discovery on Solexa data.
Tests of assembling OligoZip software on Arabidopsis chromosome 1.
Sequence of Arabidopsis Chr1 (about 30 MB) was randomly fragmented to 350-bp reads. To test how coverage and length affect assembling speed and quality, we started with 5-MB chromosome fragment with low coverage and proceeded by increasing coverage and length of a fragment.
Speed was measured on one 2.2-GHz dual core processor on Linux OS, (pretty standard computer). Having computer farm and/or more powerful processors will definitely make possible faster and scaled up assembly.
a) increasing coverage of 5-MB fragment from 10x to 20x decreased the number of clusters (contigs) from 114 to 12 and increased computer time from 16 min to 4 hours. Assembled fragments covered ~90% of the sequence, and 99% of sequence was covered by good assembled contigs and by fragments of several chimerical assembled sequences.
NOTE: Chromosome 1 has 15 big gaps presented as polyN tracts: five gaps of about 60 bp each and 10 gaps of thousands Ns each, as well as other small NNN disruptive fragments.
b) 10 MB fragment, 10x coverage -> 34 min, 215 contigs; entire Chromosome 1 (30 MB), 10x coverage -> 2 hours, 1003 contigs.
Number of contigs is roughly proportional to length of assembled fragment. For 20x Coverage, 90% decrease of contigs should be expected, with their average length about 10 times larger (620,000 bp), i.e. such fragments would on average have ~ 100 genes, a very good size for reliable annotation.
Table 1. Results of ab initio assembly of 5-MB fragment of Arabidposis Chromosome 1:
oligos ai chimeras length_of_ai_contigs glued length_of_glued_contigs cov_good cov_all time coverage min max average min max average 10x 110 6 ( 5.5%) 538 225588 45492.6 23 2943 726676 217326.9 89.41 99.99 ~10 min 15x 13 2 (15.4%) 355 1354730 384577.9 9 4823 1521262 555394.0 72.38 100.00 ~1 hour 40 min 20x 8 1 (12.5%) 4812 1861885 624837.6 6 4812 2543555 833030.2 98.41 100.00 ~3 hours 30 min ai - ab initio assembled contigs chimeras - chimeric contigs (different parts of a contig map to different regions of 5 Mb chromosome fragment) glued - ai contigs that overlap on reference genome and have sticky ends can be glued together cov_good - coverage of 5 Mb genome fragment by "good" contigs (those that are not chimeras) - most of "good" contigs align perfectly to genome fragment; some contigs have small problems but still >99% of their lengths align very well within same region on a chromosome fragment and covers >99% of that region cov_all - coverage of 5 Mb genome fragment by all alignments, including alternative alignments, of all contigs, including chimeras
Table 2. Results of ai assembling Arabidopsis Chromosome 1
oligos ai chimeras length_of_ai_contigs glued length_of_glued_contigs cov_good cov_all time coverage min max average min max average 10x 840 38 ( 4.5%) 350 314116 35597.83 331 350 1826464 90209.18 88.18 99.91 ~2 hours 15x 306 33 (10.8%) 351 2932495 97628.63 ~18 hours
We can see that 15x coverage produces contigs of average size 100 KB, up to 3MB. Size of contigs is restricted by occurrence of repeats that might also create long chimeras. It can be improved, and many chimeras resolved, by applying technique using mate-pairs, EST and gene prediction. It shouldn't be forgotten that Chromosome 1 contains 15 large gaps (polyN tracts), which we already mentioned above.
|© 2022 www.softberry.com|