HomeAll softwareProductsNew ProductsServicesManagement teamCorporate ProfileContact

Test online

Gene finding
Gene finding with similarity
Gene finding in Bacteria
Gene finding in Viruses
Next Generation
Gene search
Gene explorer
Protein location
RNA structure
Protein structure
Multiple alignment
Analysis of expression data
Plant promoter database
Search and map repeats
Extracting known SNPs



OligoZip: A Tool for Reconstructing Sequences and Transcriptome Analysis Using Next-Generation Sequencing Technology Data.

Softberry new OligoZip tool for processing short reads from next-generation sequencing machines, such as Solexa/Illumina and similar, provides effective solutions to the following tasks:

  1. De novo reconstruction of genomic sequence;
  2. Reconstruction of sequences based on reference genome from same or close species;
  3. Mutation profiling and SNP discovery in a given set of genes;
  4. Analysis of transcriptome sequence data with estimates of gene expression level and identification of gene structure of expressed splice variant.

OligoZip tool uses L-plets hashing technique to achieve fast data processing, and it takes into account reads quality information.

The set of programs of ab initio assembly:

  1. Initially cleans the oligos of adapter sequences, repeats, and low quality data
  2. Uses clustering algorithm to produce sequence contigs - ab initio or using a reference genome (see algorithm scheme)

Analysis of transcriptome data proceeds through the following steps:

  1. make FASTA files from four-line-format sequence files
  2. concatenate all *.fa files (from the same set) into one file
  3. remove head/tail NNNs and skip short reads
  4. make text -> binary reads
  5. map reads to chromosomes
  6. sort reads by chromosomes
  7. make a profile - coverage of chromosome by reads
  8. make alignments of reads that have no exact mapping (for splice sites discovery)
  9. make potential splice sites files
  10. run FGENESH gene identification, while accounting for mapped reads in exons and splice sites
  11. Calculate expression values for identified genes using number of mapped reads to a particular gene sequence and the whole genome

Testing the system

(1) De novo sequence reconstruction was tested on assembling several phage and bacterial genomes and was demonstrated to have superior clustering power compared to earlier published approach (Bioinformatics, 2007, 23(4):500-501): Simulated error-free 25mers of bacteriophage φX174 and coronavirus SARS TOR2 were assembled perfectly; and on Haemophilus influenzae genome, contigs assembled by OligoZip were almost twice as long as those assembled by published SSAKE software.

(2) To test reconstruction of bacterial sequences using reference genome, we assembled genomic sequence of Methanopyrus kandleri TAG11 on known genome of Methanopyrus kandleri AV19. Solexa reads, about 6 million each for AV19 and TAG 11, were produced by sequencing lab of Harvard Partners HealthCare Center for Genetics and Genomics. AV19 genome itself was assembled perfectly, with one extra contig that happened to be genome of phage φX174. TAG 11 reads were assembled into several hundred contigs. Similar results were achieved on five other Methanopyrus stains. The following link shows alignment of a fragment of one of assembled contigs to a reference genome in Softberry Genome Comparison Viewer.

Annotation of aligned parts of reference and assembled genomes by automatic FGENESB pipeline produces similar results (see link), indicating that no ORF distortions such as frameshifts or premature termination codons were introduced in the process of sequence assembly.

(3) As an example of SNP finding using Solexa reads, we took reads of a population sample of fragment of human EPS8 gene with known C->T substitution, marked with an asterisk on a sequence here link. Approximately 40% of 268 reads mapped to this fragment support occurrence of C->T SNP. This demonstrates feasibility of using OligoZip for SNP discovery on Solexa data.

Tests of assembling OligoZip software on Arabidopsis chromosome 1.

Sequence of Arabidopsis Chr1 (about 30 MB) was randomly fragmented to 350-bp reads. To test how coverage and length affect assembling speed and quality, we started with 5-MB chromosome fragment with low coverage and proceeded by increasing coverage and length of a fragment.

Speed was measured on one 2.2-GHz dual core processor on Linux OS, (pretty standard computer). Having computer farm and/or more powerful processors will definitely make possible faster and scaled up assembly.

a) increasing coverage of 5-MB fragment from 10x to 20x decreased the number of clusters (contigs) from 114 to 12 and increased computer time from 16 min to 4 hours.
Assembled fragments covered ~90% of the sequence, and 99% of sequence was covered by good assembled contigs and by fragments of several chimerical assembled sequences.

NOTE: Chromosome 1 has 15 big gaps presented as polyN tracts: five gaps of about 60 bp each and 10 gaps of thousands Ns each, as well as other small NNN disruptive fragments.

b) 10 MB fragment, 10x coverage -> 34 min, 215 contigs; entire Chromosome 1 (30 MB), 10x coverage -> 2 hours, 1003 contigs

Number of contigs is roughly proportional to length of assembled fragment. For 20x Coverage, 90% decrease of contigs should be expected, with their average length about 10 times larger (620,000 bp), i.e. such fragments would on average have ~ 100 genes, a very good size for reliable annotation.

Table 1. Results of ab initio assembly of 5-MB fragment of Arabidposis Chromosome 1:

oligos    ai   chimeras     length_of_ai_contigs   glued  length_of_glued_contigs  cov_good cov_all   time
coverage                    min    max     average         min     max    average

10x       110   6 ( 5.5%)   538   225588   45492.6    23   2943   726676  217326.9   89.41    99.99   ~10 min
15x        13   2 (15.4%)   355  1354730  384577.9     9   4823  1521262  555394.0   72.38   100.00   ~1 hour 40 min 
20x         8   1 (12.5%)  4812  1861885  624837.6     6   4812  2543555  833030.2   98.41   100.00   ~3 hours 30 min

ai -       ab initio assembled contigs
chimeras - chimeric contigs (different parts of a contig map to different regions of 5 Mb chromosome fragment)
glued -    ai contigs that overlap on reference genome and have sticky ends can be glued together
cov_good - coverage of 5 Mb genome fragment by "good" contigs (those that are not chimeras) - most of "good"
           contigs align perfectly to genome fragment; some contigs have small problems but still >99% of their
           lengths align very well within same region on a chromosome fragment and covers >99% of that region
cov_all -  coverage of 5 Mb genome fragment by all alignments, including alternative alignments, of all contigs,
           including chimeras

Minimum number of contigs that covers following part of 5-Mb fragment sequence:
cov. 10x506779
cov. 15x6 7 8
cov. 20x3 4 5

Table 2. Results of ai assembling Arabidopsis Chromosome 1

oligos     ai    chimeras    length_of_ai_contigs     glued   length_of_glued_contigs  cov_good cov_all time
coverage                     min     max   average             min     max    average

10x        840  38 ( 4.5%)   350   314116  35597.83     331    350  1826464   90209.18   88.18   99.91  ~2 hours  
15x        306  33 (10.8%)   351  2932495  97628.63                                                     ~18 hours

Minimum number of contigs that covers following part of entire Chromosome 1
cov. 10x271372461
cov. 15x33 45 61

We can see that 15x coverage produces contigs of average size 100 KB, up to 3MB. Size of contigs is restricted by occurrence of repeats that might also create long chimeras. It can be improved, and many chimeras resolved, by applying technique using mate-pairs, EST and gene prediction. It shouldn't be forgotten that Chromosome 1 contains 15 large gaps (polyN tracts), which we already mentioned above.

There are several techniques that we believe would for further improve the assembly:

  1. Use sequence of related known genome as a reference sequence. The best approach would be using a combination of ab initio assembly and reference genome mapping, as analyzed genome might have unique sequences that don't map to reference genome.
  2. Some contigs have overlapping ends and can be joined together: in order to minimize occurrence of chimeras, our current assembly rules prohibit elongation of a contig if coverage is low or there are some ambiguous oligos at the end.
  3. Using mate-paired libraries that allow sequences of both ends of each fragment to be determined sequentially. Mate-pairs should be assembled in the same orientation and at the expected distance. Using end read pairs of fixed-sized mate-pair libraries, we can determine relative orientations of all contigs, estimate gap sizes of each adjacent contig pair, and use this information during assembly, or reassemble incorrect assemblies by validating orientations and gap sizes.
  4. Using EST sequences: They can be used in a manner similar to mate-pairs: ESTs can define mutual orientation of reads that shows good similarity with a given EST sequence, plus reads belonging to one EST put certain restrictions on the reads' localization in a genome (overlapping reads should be overlapping in the contig and non-overlapping reads should be closer than maximum size of a gene). Also, we can map EST sequences to the assembled contigs and identify misassembled contigs that will not support the standard of EST mapped on the genome: it should be either one uninterrupted fragment or a set of fragments having corresponding conserved splice sites at their intron ends.
  5. Using predicted proteins: We can run FGENESH++ annotation pipeline on assembled contigs that will determine the closest homologous protein for the predicted ones, usually from some known genome. If a contig is misassembled, only a part of a certain protein will be predicted at a given location. Other part(s) of the same protein will be predicted at different location of a contig or even in different contigs. We can join or reassemble some contigs based on an assumption that reads corresponding to same protein, found through its known close homolog, shold lie close together.

Five techniques listed above should significantly improve accuracy of assembly and lengths of the resulting contigs. Enough data on mate-pairs (3) can potentially resolve repeated regions and make possible assembling entire chromosomes.

We believe that using OligoZip in combination with our EST_MAP program and Fgenesh++ genome annotation pipeline, the most accurate pipeline for eukaryotic genome annotation, provides significant advantage over pure assembly methods, as EST mapping and gene prediction can be used to significantly improve de novo assembly.

© 2022 www.softberry.com