An extension to PROSITE patterns are weight (or profile) matrices. In a weight matrix one do not tell what amino-acids are allowed in a certain position but rather how probably it is to find a certain amino acid in a certain position.
In a profile methods, where the scoring in some way is based on the probability of finding a certain residue at that position. In the simplest approach, the score for placing a residue at a position is based on the pairwise score between this and all the residues observed at this position. This position-specific score takes into account the distribution of amino residues found at this position. If a residue is highly conserved, however, a mismatch at such a position does not necessarily give a lower score than at a variable position.
To create a profile matrix you start from a multiple sequence alignment and calculate the frequency of each nucleotide/amino-acid in each position.
Nucleotide/position | 1 | 2 | 3 | 4 | 5 |
A | 0.21 | 0.86 | 0.10 | 0.07 | 0.13 |
C | 0.02 | 0.12 | 0.75 | 0.05 | 0.10 |
G | 0.54 | 0.01 | 0.11 | 0.64 | 0.08 |
T | 0.22 | 0.01 | 0.04 | 0.14 | 0.69 |
Example of frequency table for a 5 nucleotide sequence.
To make it easier to use the information, the frequencies are often transformed to log(p), i.e. the logarithm of the frequency divided with the probability to find this nucleotide/residue in this position by chance. The advantage of this is that it is possible to simple add the scores for the sequences when it is compared with the log(p) matrix.
Nucleotide/position | 1 | 2 | 3 | 4 | 5 |
A | -0.17 | 1.24 | -0.92 | -1.27 | -0.65 |
C | -2.53 | -0.73 | 1.10 | -1.61 | -0.92 |
G | 0.77 | -3.22 | -0.82 | 0.94 | -1.14 |
T | -0.13 | -3.22 | -1.83 | -0.58 | 1.02 |
Here we have the log(p) matrix for the frequency table above.
Now we will test the score for a sequence to a this matrix. If for instance we have the sequence AACGG we would get the score: -0.17+1.24+1.10+0.94-1.14=1.97. If we try a longer sequence we can try each shift of the sequence in the matrix and see if some shift fits better than random. Below we show a fit.
query: AACGGTGACGTGAAGTGCresults: 1.97; 0.24; -8.29; -3.99; -5.46; -7.02; 5.07; -3.44; ....
Obviously the seventh shift fitted best, i.e. the sequence GACGT fits this matrix quite well.
There is a noticeable difference in how PROSITE patterns and profiles can be used. Both methods can be used to detect short signals, but profiles can also be used to detect relationship over much longer regions, if you allow for gaps. The best examples of this is how profiles is used in PSI-BLAST and in the detection of protein families.