Short
description of the computation of nucleotide frequency matrices for various
promoter elements

To
get unrelated set of promoters, a pairwise comparison of a region [-50:+1] of 586
plant promoters (including 305 entries from the first release of DB) has been
performed and one of the couple of promoters showing more than 90% homology has
been excluded from the initial collection. As a result, 10 promoters were
excluded from the initial set of the collected promoter sequences.

In simple implementation of
Expectation Maximization (EM) algorithm we
considered the sequence of motif X=(x_{1},x_{2},...,x_{l})
, where l is the motif length. If P^{i} (x_{j}) is the empiric
frequency of the nucleotide x_{j} in position i (computed on previous
iteration), then the weight of this
motif is computed as

W(X) = log ∏ P^{i}(x_{j})/0.25

Using the EM procedure for 10 iterations the initial
collection of 576 unrelated promoters was divided into the 2 classes: 345TATA
and 231 TATA-less unrelated promoters. In calculations of TATA matrices the
allowed variation of a distance between the right boundary of the TATA-core box
and the TSS was 18-40 bp and only **TATAWAWA**-core
was used for calculating the weight. As an initial TATA-box matrix, the
TATA-matrix computed for 171 plant promoters from the first release of
PlantProm DB (__http://mendel.cs.
rhul.ac.uk____/__)
was used.

For
computation of the CCAAT-box matrix we
considered the possible distance between the right boundary of CCAAT-core and
the TSS within 50-100 bp. As an initial CCAAT-box matrix, the CCAAT-matrix
computed for 131 plant promoters from the first release of PlantProm DB (__http://mendel.cs. rhul.ac.uk____/__) was used; in accordance
with the available literature data, CCAAT boxes were identified on both
DNA strands.

The TSS-motif matrix** **of 5 bp in length has been computed, where the 3^{rd}
nucleotide was the annotated (anTSS). No strong consensus was revealed. When
the EM approach was used to analyze all possible pentanucleotides with an
assumed TSS (asTSS) location in the range [anTSS-2:anTSS+2],
it was observed that the composition of asTSS-motifs is different in dicot and
monocot plants, as well as for TATA and TATA-less promoters.

To search for statistically significant
motifs of 1577 known plant regulatory elements, nsite program (http://linux1.softberry. com/berry.phtml)
has been applied.