|
Prot_map program is a new fast tool to align proteins with genome and accurately reconstruct exon-intron gene structure
Prot_map program maps a set of protein sequences onto genomic sequence producing gene structures and
the corresponding alignments of coding exons with the similar or identical protein queries. Prot_map uses a genomic
sequence and a set of protein sequences as its input parameters. Prot_map reconstructs the gene structure on the
base of identical or similar protein instead of a set of unordered alignment fragments that generated the Blast program.
The program is very fast, and the produces gene structure similar with the accuracy of slow Genewise program
(that practically required knowing the protein genomic location) (Table 1).
You can further significantly improve the accuracy of gene reconstruction with Fgenesh+ program by
using the results of Prot_map (i.e.a fragment of genomic sequence and the protein sequence mapped on it) ( Table 2).
(1) Prot_map program is used in pipeline (Fgenesh++) of automatic annotation of new genomic sequences, as well as (2) to generate a set of genes in new genomes (without known genes) to train parameters of gene-finding programs. (3) It is very useful to find pseudogenes by selection of corrupted gene structures resulted in mapping a set of known proteins.
Figure 1. Example of mapping a protein sequence on the human 19 chromosome.
L:3000000 Sequence Chr19 [cut:1 3000000]
[DD] Sequence: 1( 1), S: 105.56, L:1739
IPI:IPI00170643.1|SWISS-PROT:Q8TEK3-1 Tax_Id=9606 Splice isoform 2 of Q8TEK3
Summ of block lengths: 1284, Alignment bounds:
On first sequence: start 2146727, end 2167197, length 20471
On second sequence: start 263, end 1682, length 1420
Blocks of alignment: 21
1 E: 2146727 70 [ca GT] P: 2146727 263 L: 23, G: 101.574 S:14.75
2 E: 2147573 107 [AG GT] P: 2147575 287 L: 35, G: 103.465, S:18.56
3 E: 2148934 42 [AG GT] P: 2148934 322 L: 14, G: 103.043, S:11.68
4 E: 2150399 111 [AG GT] P: 2150399 336 L: 37, G: 102.130, S:18.82
5 E: 2150620 235 [AG GT] P: 2150620 373 L: 78, G: 101.500, S:27.15
6 E: 2151098 114 [AG GT] P: 2151100 452 L: 37, G: 106.924, S:19.76
7 E: 2151750 92 [AG GT] P: 2151752 490 L: 30, G: 101.424, S:16.82
8 E: 2153538 102 [AG GT] P: 2153538 520 L: 34, G: 100.496, S:17.73
9 E: 2153848 138 [AG GT] P: 2153848 554 L: 46, G: 99.003, S:20.30
10 E: 2154470 126 [AG GT] P: 2154470 600 L: 42, G: 101.283, S:19.87
……………………………………………………………………………………………………………………………………………………………………………………
1 11 2146713 2146723 2146739 2146769
gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg
---------------(..)evdhqlkerfanmke GGRIVSSKPFAPLNFRINSRNLS-
248 248 249 259 267 277
2146797 2146806 2147558 2147568 2147581 2147611
]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK
---------------(..)--------------- -dIGTIMRVVELSPLKGSVSWTGK
286 286 286 286 289 299
2147641 2147671 2147686 2148919 2148926 2148937
PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP
PVSYYLHTIDRTI ---------------(..)--------------- LENYFSSLKNP
309 319 322 322 322 323
2148967 2148982 2150384 2150391 2150402 2150432
KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP
KLR ---------------(..)--------------- EEQEAARRRQQRESKSNAATP
333 336 336 336 337 347
2150462 2150492 2150513 2150523 2150609 2150619
TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK
TKGPEGKVAGPADAPM ---------------(..)--------------- DSGAEEEK
357 367 373 373 373 373
Table 1. Speed of processing sequences by Prot_map, Fgenesh+ and GeneWise.
| Fgenesh+ | Prot_map | GeneWise |
88 sequences of genes < 20 kb | ~1 min | ~1 min | ~90 min |
8 sequences of genes > 400000 kb | ~1 min | ~1 min | ~1200 min |
Table 2. Comparison of accuracy of gene identification programs: ab initio Fgenesh and prediction with protein support: Fgenesh+ , GenWise and Prot_map on a set of human genes using mouse or drosophila homologous proteins.
%CG (correct genes) is % of exactly predicted genes.
Mouse homologs: 60% < similarity level < 80% - 1425 sequences
| Sn ex | Sno ex | Sp ex | Sn nuc | Sp nuc | CC | %CG |
Fgenesh | 83.4 | 90.9 | 86.8 | 93.2 | 94.9 | 0.937 | 30 |
Genwise | 88.1 | 96.5 | 90.5 | 97.8 | 99.2 | 0.984 | 43 |
Fgenesh+ | 93.9 | 97.9 | 94.9 | 98.4 | 99.3 | 0.988 | 65 |
Prot_map | 87.0 | 96.5 | 86.6 | 97.0 | 98.5 | 0.976 | 40 |
Drosophila homologs: similarity level > 80% - 66 sequences.
| Sn ex | Sno ex | Sp ex | Sn nuc | Sp nuc | CC | %CG |
Fgenesh | 90.5 | 93.8 | 95.1 | 97.9 | 96.9 | 0.950 | 55 |
Genwise | 79.3 | 83.9 | 86.8 | 97.3 | 99.5 | 0.985 | 23 |
Fgenesh+ | 95.1 | 97.8 | 97.0 | 98.9 | 99.5 | 0.9914 | 70 |
Prot_map | 86.4 | 95.3 | 88.1 | 97.6 | 99.0 | 0.982 | 41 |
|