FGENESB Suite of Bacterial Operon and Gene Finding Programs
FGENESB is a package for automatic annotation of bacterial genomes
that includes the following features:
Automatic training of gene finding parameters for new bacterial genomes using
only genomic DNA as an input (optionally, pre-learned parameters from related
organism can be used)
Mapping of tRNA and rRNA genes
Highly accurate Markov chains-based gene prediction
Prediction of promoters and terminators
Operon prediction based on distances between ORFs and frequencies of different
genes neighboring each other in known bacterial genomes, as well as on promoter
and terminator predictions
Automatic annotation of predicted genes by homology with COG and NR databases.
FGENESB gene prediction algorithm is based on Markov chain models
of coding regions and translation and termination sites. It is the most accurate
prokaryotic gene prediction engine: see Table 1 at the bottom of this page for
its comparison with two other popular gene prediction programs. The package
includes options to works with a set of sequences such as scaffolds of bacterial
genomes or short sequencing reads extracted from bacterial communities. For
community sequence annotation, we include ABsplit
program that separates archebacterial and eubacterial sequences. FGENESB
was used in first ever published bacterial community annotation project: see
Tyson et al., (2004) Nature 428(6978), 37-43.
The final annotation can be presented in GeneBank format to be
readable by visualization software such as Artemis
or Softberry Bacterial
Genome Explorer.
Step-by-Step Description of FGENESB annotation.
Please note that web version of FGENESB includes only gene
prediction and greatly simplified operon prediction portion of the program,
represented by steps 3 and 4 below. Complete FGENESB package is available only
for local installation.
STEP 1. Finds all potential ribosomal RNA genes using BLAST
against bacterial and/or archaeal rRNA databases, and masks detected rRNA genes.
STEP 2. Predicts tRNA genes using tRNAscan-SE
program (Washington University) and masks detected tRNA genes.
STEP 3. Initial predictions of long ORFs that are used
as a starting point for calculating parameters for gene prediction. Iterates
until stabilizes. Generates parameters such as 5th-order in-frame Markov chains
for coding regions, 2nd-order Markov models for region around start codon and
upstream RBS site, Stop codon and probability distributions of ORF lengths.
STEP 4. Predicts operons based only on distances between
predicted genes.
STEP 5. Runs BLASTP for predicted proteins against COG
database, cog.pro.
STEP 6. Uses information about conservation of neighboring
gene pairs in known genomes to improve operon prediction.
STEP 7. Runs BLASTP against NR for proteins having no COGs
hits.
STEP 8. predicts potential promoters (BPROM
program) or terminators (BTERM) in upstream and downstream regions, correspondingly,
of predicted genes. BTERM is the program predicting bacterial -independent terminators
with energy scoring based on discriminant function of hairpin elements.
STEP 9. Refines operon predictions using predicted promoters
and terminators as additional evidences.
Example of FGENESB output:
1 1 Op 1 21/0.000 + CDS 407 - 1747 1311 ## COG0593 ATPase involved in DNA
+ Term 1786 - 1823 3.2
+ Prom 1847 - 1906 10.5
2 1 Op 2 3/0.019 + CDS 1926 - 3065 1237 ## COG0592 DNA polymerase
+ Term 3074 - 3122 9.1
+ Prom 3105 - 3164 4.0
3 2 Op 1 4/0.002 + CDS 3193 - 3405 278 ## COG2501 Uncharacterized ACR
4 2 Op 2 4/0.002 + CDS 3418 - 4545 899 ## COG1195 Recombinational DNA
2 Op 3 16/0.000 + CDS 4578 - 6506 2148 ## COG0187 DNA gyrase (topoisomerase II) B subunit
+ Term 6516 - 6551 4.7
+ Prom 6512 - 6571 2.3
6 2 Op 4 . + CDS 6595 - 9066 2957 ## COG0188 DNA gyrase (topoisomerase II) A subunit
+ Term 9067 - 9098 3.4
+ SSU_RRNA 9308 - 10861 100.0 # AY138279 [D:1..1554] # 16S ribosomal RNA # Bacillus cereus
+ TRNA 10992 - 11068 101.2 # Ile GAT 0 0
+ TRNA 11077 - 11152 93.9 # Ala TGC 0 0
+ LSU_RRNA 11233 - 14154 99.0 # AF267882 [D:1..2922] # 23S ribosomal RNA # Bacillus
7 3 Op 1 . - CDS 14175 - 14363 158
+ 5S_RRNA 14205 - 14315 97.0 # AE017026 [D:165635..165750] # 5S ribosomal RNA # Bacillus
8 3 Op 2 . - CDS 14353 - 15249 351 ## Similar_to_GB
9 3 Op 3 . - CDS 15170 - 15352 99
- Prom 15373 - 15432 6.9
Example of FGENESB output in GeneBank format:
TAAGVIIRMPVDQISQMGRNTQGVRLIRLEDEQEVATVAKAQKDDEEETSEEVSSEE"
/transl_table=11
terminator 9067..9098
/gene="GyrA"
gene 9308..10861
/gene="AY138279 [D:1..1554]"
rRNA 9308..10861
/gene="AY138279 [D:1..1554]"
/product="16S ribosomal RNA"
/note="AY138279 [D:1..1554]"
gene 10992..11068
/gene="Ile GAT"
tRNA 10992..11068
/gene="Ile GAT"
/product="tRNA-Val"
/note="Ile GAT 0 0"
gene 11077..11152
/gene="Ala TGC"
tRNA 11077..11152
/gene="Ala TGC"
/product="tRNA-Val"
/note="Ala TGC 0 0"
gene 11233..14154
/gene="AF267882 [D:1..2922]"
rRNA 11233..14154
/gene="AF267882 [D:1..2922]"
/product="23S ribosomal RNA"
Table 1. Accuracy of prediction estimated on B.subtilis sequence:
Frequency of genes starting from start codon other than first - 19.1%
Borodovsky et al.
(see GeneMark WEB pages) has calculated accuracy for all genes, and has
constructed three sets of difficult short genes (L <= 300bp) that have protein
similarity support. There genes were used to demonstrate that short genes also
can be predicted reasonably well. First set (51set) has 51 genes with at least
10 strong similarities to known proteins. Then 72set has 72 genes with at least
two strong similarities, and 123set has 123 genes with at least one protein
homolog.
Here are the prediction results on these three sets for GeneMarkS
and Glimmer (calculated in Nucleic Acids Research, 2001, Vol. 29, No. 12, 2607-2618.)
and FgenesB (calculated
by Softberry, three iterations of fgenesB-train script):
Sn (exact Sn (exact+overlapping
predictions) predictions)
123set:
Glimmer 57.0% 91.1
GeneMarkS 82.9 91.9
FgenesB 89.3 98.4
72set:
Glimmer 57.0% 91.7
GeneMarkS 88.9 94.4
FgenesB 91.5 98.6
51set:
Glimmer 51.0% 88.2
GeneMarkS 90.2 94.1
FgenesB 92.0 98.0
All genes of B.subtilis genome(GenBabk annotation):
Glimmer 62.4% 98.1
GeneMarkS 83.2 96.7
FgenesB 83.8 98.7
Please note that many genes in GenBank were annotated using GeneMark
program, which should result in overestimation of its accuracy.
|