Why to make sequence alignments ?

There are several reasons for comparing and aligning protein sequences: To obtain an accurate alignment for study the relationship between two proteins is one such reason. Another motivation is to scan a database with newly determined protein sequence and identify possible functions for the protein by analogy with characterized proteins.

Before starting this analysis, it is important to consider the questions we might be asking in sequence comparisons. If we find that many characters in one sequence are the same as they are in the other sequence, then we say they are similar. Later, we will calculate a similarity score, which gives the probability that the sequences are related. The following may be true for similar sequences:

  1. The sequences may share a common origin. If we have additional evidence for an evolutionary relationship, then we say that the sequences are homologous.
  2. The sequence may have the same or related structure or function.
  3. The proteins may have a similar three-dimensional structure.

The stronger the alignment-high similarity between sequences- the more likely they are to be related. Very similar sequences that are almost identical along their lengths almost certainly have the same function. Sequences that are only weakly similar may or may not be related and no firm conclusion may be drawn about their relationship. Since the discovery that the myoglobins are very similar though their sequences are not, it has been apparent that comparing structures is a more powerful if less convenient way to recognize distant evolutionary relationship than comparing sequences. Percentage identity is frequently quoted statistic for an alignment of two sequences. However, the expected value of percentage identity may be overlooked. Clearly, an alignment of length 200 showing 30% identity is more significant than an alignment of length 50 with the same identity. Sander and Schneider used protein structures to evaluate sequence comparison. Their work focused on determining a length-dependent threshold of percentage identity, above which all proteins would be of similar structure. A result of this analysis was the HSSP equation; it states that proteins with 25% identity over 80 residues will have similar structure, whereas shorter alignments require higher identity. Even if the use of percentage identity is extremely intuitive it has been shown that this measure of similarity is far from ideal. A much better measure is to use E or P-values that are obtained from a statistical analysis of the alignment scores and the alignment length.


Copyright © 2000 Arne Elofsson and Lars Liljas
Arne Elofsson
Last modified: Thu Feb 6 15:12:19 CET 2003