PROSITE Patterns

Proteins patterns are short motifs using regular expressions describing a specific amino acid sequence. These motifs can allow one or several amino-acids in a positions and also a fixed or a variable number of non-fixed amino acids.

If a motif is specific enough can it be used to search for new members of the family. The largest family of motifs are available from the PROSITE database. In PROSITE every motif is described with an expression that indicates what amino acids are available in a certain position of the motif. An example is shown below.

DE Signal peptidases I Serine active site (PS00501).

PA [GS]-x-S-M-x-P-[AT]-[LF]

NR   /TOTAL=34(34); /POSITIVE=19(19); /UNKNOWN=0(0); /FALSE_POS=15(15); /FALSE_NEG=0

The motif consist of a glycine (G) or a serine (S) followed by any arbitrary amino-acid, a serine (S), a methionine (M), another arbitrary, an alanin (A) or threonin (T), a leucine (L) or phenylanine (F). In a recent version of Swissprot there are 19 known "Signal peptidases I serine active site" and 15 other proteins that contain this motif. PROSITE contains many hundreds different patterns.

PROSITE patterns

The PROSITE Database

In the PROSITE database, a great number of patterns have been defined and used to identify related proteins. The main problem with the methods is that it might be difficult to construct a unique pattern for the search. A search among all protein sequences in the SwissProt database for the CXXCH pattern of cytochrome c gives 416 hits which are clearly the right type of protein, but also 441 hits of proteins which are not cytochromes. These false positives are unrelated proteins that do not bind heme groups and have a different conformation.

Another well-known pattern is found in nucleotide-binding so-called P-loops with the sequence (AG)XXXXGK(ST), where the letters within a parenthesis indicate observed alternatives at a certain position. This pattern is found in a large number of proteins that share the property of binding ATP or GTP, but also in many unrelated proteins. Therefore, the pattern is by itself not useful for searches. Proteins with this pattern might be identified with more specific patterns. An example is the bacterial recA proteins, which belong to this large group of ATP-binding proteins. A nonapeptide A-L-(KR)-(IF)-(FY)-(STA)-(STAD)-(LIVMQ)-R (from another part of the sequence) with only three completely conserved residues can be used to find 93 recA proteins with no false positives and only one missed recA protein.

Extending the use of patterns

When many homologous sequences are known, the search can be weighted according to the likelihood of finding a certain amino acid at a certain position. Obviously, patterns with several conserved residues will be more likely to generate only true hits. One way to use this information is in a profile but alternatively a Neural Network can be used.

External links


Arne Elofsson
Last modified: Mon May 24 14:08:04 CEST 2004