Multiple Sequence Alignments

As we now have learned there are two main reasons to align two proteins, (a) to detect related proteins and (b) to study the relationship between these two sequences. The study of more than two proteins helps in both these problems. It has been clearly shown that using multiple sequence alignments improve upon the detection of distantly related homologous proteins. New sequences can also be aligned more accurately when the alignment is based on the pattern of conservation from already aligned sequences. The alignment of many homologous sequences can also provide more information about the relationship between and functions of these proteins. Phylogenetic analysis of sequence data depends on multiple alignments. The pattern of conserved residues is also important for the functional characterization of a protein.

In this section we will distinguish between these three questions. First we will discuss how multiple sequence alignments can be obtained, secondly we will discuss how they can be used for detecting remote homologs and finally how we will discuss how to use a multiple sequence alignment to increase the understanding of a protein family.

Obtaining a multiple sequence alignment

The dynamic programming algorithm used for pairwise alignments in principle can be extended to align multiple sequences. The computing time needed for this approach increases enormously with the number of sequences, and this procedure can therefore only be used for a very limited number of sequences. However, there exist heuristic methods to obtain good multiple sequence alignments (but not the best).

Multiple pairwise alignments

The simplest method to obtain a multiple sequence alignment is to use one sequence as the base for the alignment. Then all other sequences are aligned pairwise to this sequence. This approach is for instance used in PSI-BLAST.

ClustalW

The most common procedure for multiple sequence alignment use hierarchical methods. In these methods, alignments of all pairs of sequences are made first using the dynamic programming algorithm. The sequences are then grouped according to their similarities into a tree (hierarchical cluster analysis). Finally, starting with the most similar pairs, all the sequences are aligned stepwise to each other using the dynamic programming method. This is the procedure used in the most popular program, CLUSTALW. The aligned sequences are output as well as the cluster analysis, but these procedures normally do not include any statistical analysis of the significance of the alignment.

Using multiple sequence alignment to detect distantly related homologs

As mentioned above multiple sequence alignment can be used to detect more distantly related homologs than single sequences can detect. Historically this started by the detection that some amino acids are conserved in evolutionary related proteins or in proteins that perform a similar function. The first method to detect these patterns were so called regular expressions. These expressions can only describe what residues are allowed in a certain position or not, while experience from studying the evolution in protein families has taught us that often almost all residues are allowed in all positions however they have different probabilities to exist in different positions. During the 80s Gribskov and others developed profiles methods that can be used to take this into account. An extension to the idea of profiles taken from the science of voice recognition is Hidden Markov Models. One advantage of HMMs is in theory that a better description of gaps can be obtained. However, so far most most careful benchmarks has not showed any significant difference in performance between different HMMs and profiles. In 1997 Altchul and coworkers introduced PSI-BLAST that is an easy method to create profiles using a fast algorithm (BLAST). In most occasion the performance gained be using slower methods, such as other profiles, HMMs, or specific multiple sequence alignment programs is very small (if none). Therefore, PSI-BLAST should be the first approach when a multiple sequence alignment is searched.

Analysis of protein families

Besides using multiple sequence alignments to perform better database searches they can be used to analyze a protein family. A multiple sequence alignment of a protein family can be used to analyze the evolution of this family, see the phylogeny methods section, as well as to analyze functional questions of the protein family. An example of the latter is shown here.

Last modified: Thu Jan 23 09:45:23 CET 2003