TransSeq - Program for assembling transcripts from short reads (RNASeq data)
Our algorithm for contigs assembling comes to clustering of reads.
The algorithm begins with an empty cluster. At the initial step, the first cluster takes the first read from the pool of free reads (that initially contains all the reads). The consensus of the single-read cluster is identical to read sequence. Further, the next read of the pool is to be added to the cluster assuming the alignment of this read to consensus meets certain criteria. Once the second read (as well as all the following ones) is added to the cluster, the consensus is recalculated. All reads included into cluster are excluded from the free reads pool.
To speed up the search for similar sequences in the file with reads, the algorithm utilizes the L-plet (word of length L) method. The algorithm attempts to take only those reads that contain a certain (preliminary defined) number of L-plets identical to terminal fragment of the consensus (of a certain length).
In the case, when there is an information on reads pairness, the cluster is being assembled in two stages. At the first, the cluster is assembled without considering the information on pairs. As soon as contig reaches the length that guarantees the aligning of both reads of a pair, the algorithm starts to check whether reads are mapped to contig in allowed distance. That being the case, both reads of a pair are to be accepted to the cluster and excluded from the pool, and contig is lengthening. If the contig assembling is stopped prior to step that uses the information on pairs, the contig is to be disassembled and reads are to be returned to the pool.
The contig lengthening process continues till the free pool contains reads, aligning of which to consensus allows to include them into current cluster.
The consensus of cluster is output as contig.
The assembling of following clusters (as well as contigs) occurs in the same manner. Clusters are being assembled of the reads that were not used during the assemblage of previous clusters.
The assemblage process is stopped when the pool of free reads is exhausted.
In the case of RNA splicing variants assemblage, the clustering algorithm is featured by a minor difference. Clusters, as for regular algorithm variant, are being initialized with reads of the free pool (that were not used previously). Meanwhile, the reads used for lengthening of a certain contig, in contrast to regular algorithm, can be used further for lengthening of any other contig. Thus, the algorithm mostly guarantees the recovery of any exon, that is covered by a number of reads, as well as adjacent exons.
Program operates with the array of paired sequences that are organized into file in multifasta format.
To evaluate the accuracy we used the artificially created set of sequences.
Contigs obtained with use of TransSeq were aligned to original set of mRNAs. Alignment with homology not less than 0.98 and coverage not less than 90% of contig length was considered to be valid. If multiple alignments were obtained, the one with the best score was chosen.
The following attributes were evaluated:
Original number of mRNAs 1334
In following pictures the examples of original mRNAs layout (brown color) and results of mapping of obtained contigs to chromosome 22 with use of est_map program are shown.
|© 2020 www.softberry.com|