Signals in protein sequence regulate a lot of the chemistry in a cell. For instance there are cells that regulate if a particular protein can be cleaved at a particular position, if it can be glycyoslated etc. Further there are signals that predict if a particular cell is exported from the cell etc. Some of these signals are easy to detect others are very subtle.
It is often not possible to detect a signal until a substantial number of proteins belonging to the same family is sequenced and annotated. When the proteins of a family are aligned correctly it is possible to calculate the bias for a certain residue in a certain position. If such a bias exist a logo is an efficient way to visualize it.
If a motif is specific enough can it be used to search for new members of the family. The largest family of motifs are available from the PROSITE database. In PROSITE every motif is described with an expression that indicates what amino acids are available in a certain position of the motif. An example is shown below.
DE Signal peptidases I serine active site (PS00501).
PA [GS]-x-S-M-x-P-[AT]-[LF]
NR /TOTAL=34(34); /POSITIVE=19(19); /UNKNOWN=0(0); /FALSE_POS=15(15); /FALSE_NEG=0
The motif consist of a glycine (G) or a serine (S) followed by any arbitrary aminoacid, a serine (S), a methionine (M), another arbitrary, an alanin (A) or threonin (T), a leucine (L) or phenylanine (F). In a recent version of Swissprot there are 19 known "Signal peptidases I serine active site" and 15 other proteins that contain this motif. Prosite contains many hundreds different patterns.
An extension to prosite patterns are weight (or profile) matrices. In a weight matrix one do not tell what aminoacids are allowed in a certain position but rather how probably it is to find a certain amino acid in a certain position.
To create a profile matrix you start from a multiple sequence alignment and calcualte the frequency of each nuclotide/aminoacid in each position.
Nuclotide/position | 1 | 2 | 3 | 4 | 5 |
A | 0.21 | 0.86 | 0.10 | 0.07 | 0.13 |
C | 0.02 | 0.12 | 0.75 | 0.05 | 0.10 |
G | 0.54 | 0.01 | 0.11 | 0.64 | 0.08 |
T | 0.22 | 0.01 | 0.04 | 0.14 | 0.69 |
Example of frequency table for a 5 nuclotide sequnnce.
To make it easier to use the information, the frequencies are often transformed to log(p), i.e. the logarithm of the frequence divided with the probability to find this nucleotide/residue in this position by chance. The advantage of this is that it is possible to simple add te scores for the sequences when it is compared with the log (p)matrix.
Nucleotide/position | 1 | 2 | 3 | 4 | 5 |
A | -0.17 | 1.24 | -0.92 | -1.27 | -0.65 |
C | -2.53 | -0.73 | 1.10 | -1.61 | -0.92 |
G | 0.77 | -3.22 | -0.82 | 0.94 | -1.14 |
T | -0.13 | -3.22 | -1.83 | -0.58 | 1.02 |
Here we have the log(p) matrix for the frequencey table above.
Now we will test the score for a sequence to a this matrix. If for instance we have the sequecne AACGG we woulf get the score: -0.17+1.24+1.10+0.94-1.14=1.97. If we try a longer sequence we can try each shoft od the sequence in the matrix and see if some shift fits better than random. Below we show a fit.
query: AACGGTGACGTGAAGTGCresults: 1.97; 0.24; -8.29; -3.99; -5.46; -7.02; 5.07; -3.44; ....
Obviously the seventh shift fitted best, i.e. the sequence GACGT fits this matrix quite well.
With the increased amount of sequence data produced it has been even more important to increase the sensitivity of a given method. One method for this is to use a neural network or other machine learning approaches.