Analyze RNASeq Next Generation Sequencing Data:
ReadsMap - Mapping sequence reads (spliced or unspliced) to chromosome (contigs)
ReadsMap is a fast short read aligner that quickly maps/aligns large sets of short DNA sequences. Multiple processors can be used optionally to achieve greater alignment speed. On initial stage we map “exonic” reads that demonstrate high-quality, non-interrupted alignment to a genomic sequence. Potentially, this step would map most of the reads to a genome, and the remaining “non-mapped” group would be small enough to be subjected to more thorough analysis. At the second step, we use a modified variant of our EST_MAP program to align these “non-mapped” reads using splice site matrices and producing very accurate alignment with gaps. This reads will indentify potential exon-intron boundaries.
Accuracy of reads mapping was estimated on a set of read pairs (each read is 76 base long and generated with 2% of random mutations) from 1356 mRNA sequences of human chromosome 22. We computed performance of ReadsMap and popular Bowtie read mapping program (Genome Biology, 2009, 10:R25) in aligning 1963694 of such reads to chromosome 22 sequence.
Extra alignments affecting specificity of mapping are often produced due to occurrence of alternatively spliced variant of mRNA.
On the same data we have estimated Bowtie (v2) performance. As the Bowtie gap penalty is proportional to gap size it does not produce full alignment across introns that are thousands nucleotides long. Therefore we made comparison by accounting the accuracy of alignment of the most long alignment fragment.
We can conclude that the current version of Bowtie is not suitable for mapping spliced reads from RNASeq data having Sn=0.38 compared with ReadsMap Sn=0.99 (Table2).
In Fig 1 a typical example of a spliced read alignment is presented. While the longest fragment of alignment is located by Bowtie in the right place, the short fragment is aligned incorrectly.
Fig 1. ReadsMap alignment:
nnnnnnnnnnnnnnnn(..)gtcagaaagtaactggCAAATT]ctatgtataaaattgt(..)taatgtaaacacttac[ ................(..)................|||||| ................(..)................ ----------------(..)----------------CAAATT ----------------(..)---------------- 1 1 1 1 5 7 7 7 16277748 16277758 16277768 16277778 16277788 16277798 16277808 16277818 ACATTATGACGACTAGAAACAGCATACTCTCTGGCCGTCTGTCCAGATAGATCTTGAGAAGATACATCAAtgttttgctc ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||.......... ACATTATGACGACTAGAAACAGCATACTCTCTGGCCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA----------
1 11 16277729 16277739 16277747 16277752 16277762 16277772 nnnnnnnnnnnnnnnn(..)ttttaatgtaaacact?[TACA]---[CATTATGACGACTAGAAACAGCATACTCTCTGG ................(..)................ 0|0| ... ||||||||||||||||||||||||||||||||| ----------------(..)---------------- CAAA tta CATTATGACGACTAGAAACAGCATACTCTCTGG 1 1 1 1 3 11 21 31 16277782 16277792 16277802 16277812 16277820 16277830 51304553 51304563 CCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA]?tgttttgctcaagtag(..)nnnnnnnnnnnnnnnn |||||||||||||||||||||||||||||||||||| ................(..)................ CCGTCTGTCCAGATAGATCTTGAGAAGATACATCAA ----------------(..)---------------- 41 51 61 71 77 77 77 77
ReadsMap can exactly locate even very short alignment fragments separated by long introns as it uses information from alignments of all set of reads.
TransSeq - Program for assembling transcripts from short reads (RNASeq data)
Our algorithm for contigs assembling comes to clustering of reads.
The algorithm begins with an empty cluster. At the initial step, the first cluster takes the first read from the pool of free reads (that initially contains all the reads). The consensus of the single-read cluster is identical to read sequence. Further, the next read of the pool is to be added to the cluster assuming the alignment of this read to consensus meets certain criteria. Once the second read (as well as all the following ones) is added to the cluster, the consensus is recalculated. All reads included into cluster are excluded from the free reads pool.
To speed up the search for similar sequences in the file with reads, the algorithm utilizes the L-plet (word of length L) method. The algorithm attempts to take only those reads that contain a certain (preliminary defined) number of L-plets identical to terminal fragment of the consensus (of a certain length).
In the case, when there is an information on reads pairness, the cluster is being assembled in two stages. At the first, the cluster is assembled without considering the information on pairs. As soon as contig reaches the length that guarantees the aligning of both reads of a pair, the algorithm starts to check whether reads are mapped to contig in allowed distance. That being the case, both reads of a pair are to be accepted to the cluster and excluded from the pool, and contig is lengthening. If the contig assembling is stopped prior to step that uses the information on pairs, the contig is to be disassembled and reads are to be returned to the pool.
The contig lengthening process continues till the free pool contains reads, aligning of which to consensus allows to include them into current cluster.
The consensus of cluster is output as contig.
The assembling of following clusters (as well as contigs) occurs in the same manner. Clusters are being assembled of the reads that were not used during the assemblage of previous clusters.
The assemblage process is stopped when the pool of free reads is exhausted.
In the case of RNA splicing variants assemblage, the clustering algorithm is featured by a minor difference. Clusters, as for regular algorithm variant, are being initialized with reads of the free pool (that were not used previously). Meanwhile, the reads used for lengthening of a certain contig, in contrast to regular algorithm, can be used further for lengthening of any other contig. Thus, the algorithm mostly guarantees the recovery of any exon, that is covered by a number of reads, as well as adjacent exons.
Program operates with the array of paired sequences that are organized into file in multifasta format.
To evaluate the accuracy we used the artificially created set of sequences.
Contigs obtained with use of TransSeq were aligned to original set of mRNAs. Alignment with homology not less than 0.98 and coverage not less than 90% of contig length was considered to be valid. If multiple alignments were obtained, the one with the best score was chosen.
The following attributes were evaluated:
Original number of mRNAs 1334
In following pictures the examples of original mRNAs layout (brown color) and results of mapping of obtained contigs to chromosome 22 with use of est_map program are shown.
|© 2020 www.softberry.com|