Introduction to sequence alignments

Nucleotide and protein sequences in organisms are inherited from their ancestors. In this process gene duplications, point mutations, and other events will change the sequences. Related sequences in different organisms will therefore not be identical.

Accurate alignments of sequences are needed for many types of analyses. Aligned sequences are the basis of phylogenetic analysis and of modeling of protein conformation. It can be used to identify functions of genes and proteins. Alignment methods are also used to search for similarities between new sequences and sequences in databases. Depending on the purposes, different properties of the alignment algorithm are important: searches in extensive databases require speed, while algorithms for alignments of homologous sequences can be optimized to use all available information to produce the most reliable alignment. The alignments can be made in many different ways, using many different types of information. Here we will describe dynamic programming, which is a computational method to detect the optimal pairwise alignment between two proteins. Alignments are also the basic method used to detect related proteins/genes in a database.

Definition of an alignment

An alignment refers to the procedure of comparing two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences. Identical or similar characters are placed in the same column, and non- identical characters can either be placed in the same column as a mismatch or opposite to a gap in one of the other sequences. In an optimal alignment, non-identical characters and gaps are so placed to bring as many identical or similar characters as possible into vertical register. Two main types of sequence alignment have been recognized, Global and local. The global alignment optimizes the alignment over the full-length of the sequences. In local alignment, stretches of sequence with the highest density of matches are given the highest priority. The following is an example of global and local alignment.

Global alignment:

	LGPSTKDFGKISESREFDN
        |      ||||    | 
	LNQLERSFGKINMRLEDA

The alignment is stretched over the entire sequence lengths to include as many matching amino acids as possible up to and including the sequence ends. Although there is an obvious region of identity in this example ( the sequence FGKI), a global alignment may not align such regions in order to favor matching more amino acids along the entire sequence length.

Local alignment:

	----------FGKI----------
                  ||||
	----------FGKI----------

Local alignment of the same sequences as above. In this case, the alignment tends to stop at the ends of regions of identity or strong similarity. A much higher priority is given to finding these local regions than to extending the alignment to include more neighboring amino acid pairs. Dashes indicate sequence not included in the alignment. This type of alignment favors finding conserved amino acid motifs in related protein sequences.

Arne Elofsson

Last modified: Tue Jan 14 09:56:33 CET 2003