HomeAll softwareProductsNew ProductsServicesManagement teamCorporate ProfileContact

Test online

Gene finding
Gene finding with similarity
Gene finding in Bacteria
Gene finding in Viruses
Next Generation
Gene search
Gene explorer
Protein location
RNA structure
Protein structure
Multiple alignment
Analysis of expression data
Plant promoter database
Search and map repeats
Extracting known SNPs



Analyze RNASeq Next Generation Sequencing Data:

  • Accurate alignment of high-throughput RNA-seq data to a reference genome (ReadsMap);
  • De novo transcriptome reads assembly into RNA transcripts (TransSeq);

ReadsMap - Mapping sequence reads (spliced or unspliced) to chromosome (contigs)

ReadsMap is a fast short read aligner that quickly maps/aligns large sets of short DNA sequences. Multiple processors can be used optionally to achieve greater alignment speed. On initial stage we map “exonic” reads that demonstrate high-quality, non-interrupted alignment to a genomic sequence. Potentially, this step would map most of the reads to a genome, and the remaining “non-mapped” group would be small enough to be subjected to more thorough analysis. At the second step, we use a modified variant of our EST_MAP program to align these “non-mapped” reads using splice site matrices and producing very accurate alignment with gaps. This reads will indentify potential exon-intron boundaries.

Accuracy of reads mapping was estimated on a set of read pairs (each read is 76 base long and generated with 2% of random mutations) from 1356 mRNA sequences of human chromosome 22. We computed performance of ReadsMap and popular Bowtie read mapping program (Genome Biology, 2009, 10:R25) in aligning 1963694 of such reads to chromosome 22 sequence.

Table 1. ReadsMap performance
READS Aligned (%) Alignments True Sn Sp
Unspliced 1479130 (100%) 1610551 14748464 0.997 0.92
Spliced 484311 (99.95%) 507887 477170 0.985 0.94
All 1963441 (99.99) 2118534 1952916 0.994 0.92

Extra alignments affecting specificity of mapping are often produced due to occurrence of alternatively spliced variant of mRNA.

On the same data we have estimated Bowtie (v2) performance. As the Bowtie gap penalty is proportional to gap size it does not produce full alignment across introns that are thousands nucleotides long. Therefore we made comparison by accounting the accuracy of alignment of the most long alignment fragment.

Table 2. ReadsMap vs Bowtie performance
ReadsMap Aligned (%) Alignments True Sn Sp
Unspliced 1479130 (100%) 1610551 1479129 1.00 0.92
Spliced 484311 (99.95%) 507887 477170 0.999 0.94
All 1963441 (99.99) 2118534 1952916 0.999 0.93
Bowtie Aligned (%) Alignments True Sn Sp
Unspliced 1479130 (100%) 1610551 14748464 0.997 0.99
Spliced 484311 (99.95%) 507887 477170 0.375 0.99
All 1963441 (99.99) 2118534 1952916 0.834 0.99

We can conclude that the current version of Bowtie is not suitable for mapping spliced reads from RNASeq data having Sn=0.38 compared with ReadsMap Sn=0.99 (Table2).

In Fig 1 a typical example of a spliced read alignment is presented. While the longest fragment of alignment is located by Bowtie in the right place, the short fragment is aligned incorrectly.

Fig 1. ReadsMap alignment:

          ................(..)................|||||| ................(..)................ 
          ----------------(..)----------------CAAATT ----------------(..)---------------- 
          1         1         1         1         5         7         7         7         
   16277748  16277758  16277768  16277778  16277788  16277798  16277808  16277818

Bowtie alignment:

          1        11  16277729  16277739  16277747  16277752  16277762  16277772
          ................(..)................  0|0| ... |||||||||||||||||||||||||||||||||
          ----------------(..)----------------  CAAA tta CATTATGACGACTAGAAACAGCATACTCTCTGG
          1         1         1         1         3        11        21        31
   16277782  16277792  16277802  16277812  16277820  16277830  51304553  51304563
          ||||||||||||||||||||||||||||||||||||  ................(..)................
          CCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA  ----------------(..)----------------
         41        51        61        71        77        77        77        77

ReadsMap can exactly locate even very short alignment fragments separated by long introns as it uses information from alignments of all set of reads.

TransSeq - Program for assembling transcripts from short reads (RNASeq data)

Our algorithm for contigs assembling comes to clustering of reads.

The algorithm begins with an empty cluster. At the initial step, the first cluster takes the first read from the pool of free reads (that initially contains all the reads). The consensus of the single-read cluster is identical to read sequence. Further, the next read of the pool is to be added to the cluster assuming the alignment of this read to consensus meets certain criteria. Once the second read (as well as all the following ones) is added to the cluster, the consensus is recalculated. All reads included into cluster are excluded from the free reads pool.

To speed up the search for similar sequences in the file with reads, the algorithm utilizes the L-plet (word of length L) method. The algorithm attempts to take only those reads that contain a certain (preliminary defined) number of L-plets identical to terminal fragment of the consensus (of a certain length).

In the case, when there is an information on reads pairness, the cluster is being assembled in two stages. At the first, the cluster is assembled without considering the information on pairs. As soon as contig reaches the length that guarantees the aligning of both reads of a pair, the algorithm starts to check whether reads are mapped to contig in allowed distance. That being the case, both reads of a pair are to be accepted to the cluster and excluded from the pool, and contig is lengthening. If the contig assembling is stopped prior to step that uses the information on pairs, the contig is to be disassembled and reads are to be returned to the pool.

The contig lengthening process continues till the free pool contains reads, aligning of which to consensus allows to include them into current cluster.

The consensus of cluster is output as contig.

The assembling of following clusters (as well as contigs) occurs in the same manner. Clusters are being assembled of the reads that were not used during the assemblage of previous clusters.

The assemblage process is stopped when the pool of free reads is exhausted.

In the case of RNA splicing variants assemblage, the clustering algorithm is featured by a minor difference. Clusters, as for regular algorithm variant, are being initialized with reads of the free pool (that were not used previously). Meanwhile, the reads used for lengthening of a certain contig, in contrast to regular algorithm, can be used further for lengthening of any other contig. Thus, the algorithm mostly guarantees the recovery of any exon, that is covered by a number of reads, as well as adjacent exons.


Program operates with the array of paired sequences that are organized into file in multifasta format.
Meanwhile, the paired sequences are written to file sequentially (even/odd).
In configuration file for raw sequences the following data are to be specified:
à. Number of sequences in the file (field READS:)
b. Length of sequences (READ_LEN:)
c. Type of pairness (in current version the READ_TYPE:PAIR_ENDS only is supported)
d. Average distance between far-out ends of pairs (LEN_AVER1:)
e. Standard deviation for distance between far-out ends of pairs (LEN_STDEV1:)

The example of configuration file is in the sample.di file

Accuracy evaluation

To evaluate the accuracy we used the artificially created set of sequences.
For these purposes the layout of human chromosome 22 was taken (http://genome.ucsc.edu/, h19 set), and on this basis the file with 1334 mRNA sequences was created.
On the basis of these sequences, pairs of reads with random number of mutations were generated. Thus 1960300 short reads were obtained.

Contigs obtained with use of TransSeq were aligned to original set of mRNAs. Alignment with homology not less than 0.98 and coverage not less than 90% of contig length was considered to be valid. If multiple alignments were obtained, the one with the best score was chosen.

The following attributes were evaluated:
1. Number of found mRNAs - mRNAs for which the corresponding contigs were found
2. Number of found contigs
3. Number of valid contigs - contigs that correspond to original mRNAs
4. Sensitivity (Sn) - ratio of mRNAs number, for which the corresponding contigs were found, to that of original mRNAs
5. Specificity (Sp) - ratio of valid contigs number to that of all found contigs
6. Excessiveness - ratio of found mRNAs number to that of valid contigs


Original number of mRNAs 1334
Number of found mRNAs 1299
Number of found contigs 8511
Number of valid contigs 7789
Number of invalid contigs 722
Sn 0.973
Sp 0.915
Excessiveness = 0,166

In following pictures the examples of original mRNAs layout (brown color) and results of mapping of obtained contigs to chromosome 22 with use of est_map program are shown.

Layout region uc002zlh.1 and uc010gqp.2 (- strand) (excessiveness of obtained contigs)
Click here to view large image
Layout region uc002zvn.2 uc002zvo.2 uc010gtk.1 (- strand) (alternative splicing)
Click here to view large image
Layout region uc011aih.1, uc002zve.2 uc002zvf.2 (+ strand) (alternative mRNAs).
Click here to view large image
© 2023 www.softberry.com