Program for mapping the Tandem Repeats Regions in protein sequences.

TandemRep mapping is performed by searching regions with uniform diplet composition. The searching is initiated for the regions flanked by short ideal repeated elements.

Tandem searching algorithm consists of the following stages:
1) Find a pair of l-plets C_{1} and C_{2} with a distance
between C_{1}
and C_{2} not exceeding predefined N. The region between and including
C_{1} è C_{2}
will be denoted as R_{1} with the length L_{1}.
If C_{1} and C_{2}
overlap then tandem unit size can be found trivially, jump to p.5.

2) Implying that C_{1} and C_{2}
flanks do not contain insertions/deletions, extend
synchronously C_{1} and C_{2}
allowing 1 mismatch per several matches. Extended C_{1}
and C_{2} we will denote as
C_{3} and C_{4}. After this operation the region will
be denoted as R_{2} with the length L_{2}
(>= L_{1}). If extension performed without
mismatches and C_{3} and C_{4} overlap
then we have ideal tandem which unit size again can
be found trivially, followed by jump to p.5. If extension performed with mismatches and C_{3} and
C_{4} overlap then we have almost ideal tandem which unit size can be
found according p.4 (jump to p.4).
Proceed if C_{3} and C_{4} do not overlap.

3) Now region R_{2} looks as follows

C_{3}C_{4}########-----------------------------######## | W_{1}| W_{2}| W_{3}| W_{4}| W_{...}| W_{n-1}| W_{n}|

For the region R_{2} perform the following test.
Divide region into set of windows W_{1}, …, W_{n},
each of size U. Consequently compare mono- (or di-) plet composition of the windows W_{1}
and W_{i}. If the difference in such composition between W_{1}
and some window W_{i} exceeds predefined threshold then stop.
Test is not passed, jump to the p.1 to consider the next pair of l-plets. If the difference is low for all
windows W_{2}, …, W_{n} then the test is passed and at
least fragment R_{2} could be declared tandem region.

Since we don't know the size of the window at which test described above could be passed,
the test is performed for the window sizes U = 2, …, L_{2}/2.

Remember the lowest U at which the test is passed. Denote it U_{1}.

3a) Since uniform mono- (or di-) plet composition does not guarantee homology in windows W_{1}
and W_{i}, at this step the identity calculated by cycled Smith-Waterman
algorithm is used for the additional filtering. If such an identity does not exceed predefined threshold then
calculation is stopped for the C_{1} and C_{2} pair.

4) Calculate more precisely unit size U_{opt} of the tandem using two small
windows synchronously sliding at the distance U one from another, U changes from U_{1} to
L_{2}/2.

5) Using U_{opt} calculated at the previous step find precise margins of the tandem using again two small
synchronously sliding windows.

Such a procedure is carried out for all pairs C_{1} and C_{2}
possible in the sequence. The final map of the tandems is an interception of tandems found for all l-plet pairs.

>EXAMPLE SEQ Masked regions: p1: 81 p2: 120 l: 40 chain(+) [Tandem Repeat] p1: 191 p2: 208 l: 18 chain(+) [Tandem Repeat]

>EXAMPLE SEQ ASFDPHEKQLIGDLWHKVDVAHCGGEALSRMLIVYPWKRRYFENFGDISNAQAIMHNEKVQAHGKKVLASFGEAVCHLDG XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXIRAHFANLSKLHCEKLHVDPENFKLLGDIIIIVLAAHYPK DFGLECHAAYQKLVRQVAAALAAEYHIGDLXXXXXXXXXXXXXXXXXX

>EXAMPLE SEQ asfdphekqligdlwhkvdvahcggealsrmlivypwkrryfenfgdisnaqaimhnekvqahgkkvlasfgeavchldg EKEKEKEKEKEKEKEKEKEKEEKEKEKEKEKEKEKEKEKEirahfanlsklhceklhvdpenfkllgdiiiivlaahypk dfglechaayqklvrqvaaalaaeyhigdlEKPEKPEKPEKPEKPEKP

>seq:1 beg:81 len:40 EK EK EK EK EK EK EK EK EK EK EE KE KE KE KE KE KE KE KE KE >seq:1 beg:191 len:18 EKP EKP EKP EKP EKP EKP....