A direct way of comparing sequences is the dotplot analysis. The two sequences to be compared form the rows and columns of a matrix, and in the simplest case a dot is plotted in a graphical representation of the matrix wherever the sequences are identical. In practice, the comparison is most often done in a window of specified length, and a dot is put in the graph whenever the score within this window is above a certain specified threshold. The score might be calculated as the number of identities or as a similarity based on a scoring matrix. Sequence similarities will be obvious as diagonals in the plot. The method does not align sequences but it is a quick way of spotting similarities. It has the additional advantage over other alignment methods that it can detect repetitions (parallel diagonals) and permutations in the sequences.
A dotplot analysis of the sequences of human hemoglobin a chain and Erythrocruorin from Chironomus.
Dotplots are useful to visualize potential similarities between proteins, however they are not really useful for large scale comparisons as they need manual inspection.
Global alignments aim at optimally aligning two or more sequences. The most used methods for global alignments are based on algorithms originally developed by Needleman and Wunch and modified by Sellers . This procedure (the NWS method) is using a dynamic programming algorithm that simplifies the enormous task of calculate a score for all possible alignments of two sequences with gaps of any lengths. The sequences to be aligned are arranged as rows and columns of a rectangular matrix. A score is calculated for each position of the matrix according to three possible events: replacement (or conservation) of a residue, insertion in sequence A or insertion in sequence B. The matrix elements Di,j are filled with numbers according the rule.
Here the first alternative corresponds to the case of no insertions or deletions: to the previous diagonal element is added the score b for the current pair of residues (Ai, Bj). The other two alternatives correspond to insertions in A or B: to the element above or to the left is added the score for a deletion of length k. The scoring begins at the first element of the matrix, corresponding to one end of the sequences. An example using a short nucleotide sequence is shown in the figure below. The total score is found as the last element in the matrix. The actual sequence alignment is obtained from the path through which this numbers is obtained.
|
D |
G |
C |
C |
A |
A |
G |
T |
A |
G |
G |
D |
0 |
-4 |
-5 |
-6 |
-7 |
-8 |
-9 |
-10 |
-11 |
-12 |
-13 |
A |
-4 |
-1 |
-5 |
-6 |
-3 |
-4 |
-8 |
-9 |
-7 |
-11 |
-12 |
C |
-5 |
-5 |
2 |
-2 |
-3 |
-4 |
-5 |
-6 |
-7 |
-8 |
-9 |
G |
-6 |
-2 |
-2 |
1 |
-3 |
-4 |
-1 |
-5 |
-6 |
-4 |
-5 |
A |
-7 |
-6 |
-3 |
-3 |
4 |
0 |
-1 |
-2 |
-2 |
-4 |
-5 |
G |
-8 |
-4 |
-4 |
-4 |
0 |
3 |
3 |
-1 |
-2 |
1 |
-1 |
C |
-9 |
-8 |
-1 |
-1 |
-1 |
-1 |
2 |
2 |
-2 |
-3 |
0 |
G |
-10 |
-6 |
-5 |
-2 |
-2 |
-2 |
2 |
1 |
1 |
1 |
0 |
T |
-11 |
-10 |
-6 |
-6 |
-3 |
-3 |
-2 |
5 |
1 |
0 |
0 |
A |
-12 |
-11 |
-7 |
-7 |
-3 |
0 |
-3 |
1 |
8 |
4 |
3 |
T |
-13 |
-12 |
-8 |
-8 |
-5 |
-4 |
-1 |
0 |
4 |
7 |
3 |
G |
-14 |
-10 |
-9 |
-9 |
-6 |
-5 |
-1 |
-1 |
3 |
7 |
10 |
A |
-15 |
-14 |
-10 |
-10 |
-7 |
-3 |
-5 |
-2 |
2 |
3 |
6 |
Alignment of two nucleotide sequences. Here a match has a score of +3, a mismatch a score of -1. Gaps are scored by -3 for insertion of a gap and -1 for each inserted residue (w=-3-k). The scores are calculated from the 5' end. Bold numbers show the path for reaching the highest number.
The obtained alignment looks like this:
GCCAA-GTAGG- ACGAGCGTATGA
So how was this matrix obtained ? To symbolize this we write all the D values for the first four nucleotides:
X |
G |
C |
C |
A |
|||||
Y |
0 |
-4 |
-5 |
-6 |
-7 |
||||
A |
-4 |
-1 -8 -8 |
-1 |
-5 -9 -5 |
-5 |
-6 -10 -6 |
-6 |
-3 -11 -7 |
-3 |
C |
-5 |
-5 -5 -9 |
-5 |
2 -9 -9 |
2 |
-2 -10 -2 |
-2 |
-7 -7 -3 |
-3 |
G |
-6 |
-2 -6 -10 |
-2 |
-6 -2 -6 |
-2 |
1 -6 -6 |
1 |
-3 -7 -3 |
-3 |
A |
-7 |
-7 -6 -11 |
-6 |
-3 -3 -10 |
-3 |
-3 -3 -7 |
-3 |
4 -7 -7 |
4 |
At one position of this example, the alignment is ambiguous. In this particular case, it is the trivial difference of where to put the single gap in the top sequence. It is important to note that any sequences can be aligned using these procedures, irrespective of their possible homology. The standard implementation of the DP algorithm present one solution and do not show alternative alignments with the same score.
In a global alignment, gaps at the beginning and end of the sequences have the same gap penalties as gaps in other parts of the sequences. This has the effect that the algorithm will try to align all of the two sequences. If the sequences are of different lengths or if only parts of the sequences are similar this might have negative effects. One way to handle such cases is not to penalize gaps at the ends of the sequences. The figure below shows the same sequences aligned under this condition. The optimal alignment is chosen as the one giving the highest number in the last row or column. In this case, the alignment of the same nucleotide sequence is different, but the same number of identities has been obtained.
|
D |
G |
C |
C |
A |
A |
G |
T |
A |
G |
G |
D |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
A |
0 |
-1 |
-1 |
-1 |
3 |
3 |
-1 |
-1 |
3 |
-1 |
-1 |
C |
0 |
-1 |
2 |
2 |
-1 |
2 |
2 |
-2 |
-1 |
2 |
-2 |
G |
0 |
3 |
-1 |
1 |
1 |
-2 |
5 |
1 |
0 |
2 |
5 |
A |
0 |
-1 |
2 |
-2 |
4 |
4 |
1 |
4 |
4 |
0 |
1 |
G |
0 |
3 |
-1 |
1 |
0 |
3 |
7 |
3 |
3 |
7 |
3 |
C |
0 |
-1 |
6 |
2 |
1 |
0 |
3 |
6 |
2 |
3 |
6 |
G |
0 |
3 |
2 |
5 |
1 |
0 |
3 |
2 |
5 |
5 |
6 |
T |
0 |
-1 |
2 |
1 |
4 |
0 |
1 |
6 |
2 |
4 |
4 |
A |
0 |
-1 |
-2 |
1 |
4 |
7 |
3 |
2 |
9 |
5 |
4 |
T |
0 |
-1 |
-2 |
-3 |
0 |
3 |
6 |
6 |
5 |
8 |
4 |
G |
0 |
3 |
-1 |
-3 |
-1 |
2 |
6 |
5 |
4 |
8 |
11 |
A |
0 |
-1 |
2 |
-2 |
0 |
2 |
2 |
5 |
8 |
4 |
7 |
Alignment with no gap penalty at ends (first row and column initiated with 0), using the same scoring parameters. The alignment is different from that of the table above.
Obtained alignment
----GCCAAGTAGG- ACGAGC---GTATGA
The dynamic programming algorithm can also be used for finding local sequence similarities. The Smith and Waterman algorithm is very similar to the NWS method except that a calculated negative number for a matrix position is replaced by zero, indicating that no sequence similarity has been detected up to that point. When all the matrix element have been calculated, the maximum number in the matrix is located, and the alignment is traced back from this point until the first positive number. The same sequences as above aligned with these assumptions are shown in Fig. xii. Once more, a different alignment is obtained with the simple test sequence. Only the aligned parts of the sequences are shown as output of this procedure.
|
D |
G |
C |
C |
A |
A |
G |
T |
A |
G |
G |
D |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
A |
0 |
0 |
0 |
0 |
3 |
3 |
0 |
0 |
3 |
0 |
0 |
C |
0 |
0 |
3 |
3 |
0 |
2 |
2 |
0 |
0 |
2 |
0 |
G |
0 |
3 |
0 |
2 |
2 |
0 |
5 |
1 |
0 |
3 |
5 |
A |
0 |
0 |
2 |
0 |
5 |
5 |
1 |
4 |
4 |
0 |
1 |
G |
0 |
3 |
0 |
1 |
1 |
4 |
8 |
4 |
3 |
7 |
3 |
C |
0 |
0 |
6 |
3 |
0 |
0 |
4 |
7 |
3 |
3 |
6 |
G |
0 |
3 |
2 |
5 |
2 |
0 |
3 |
3 |
6 |
6 |
6 |
T |
0 |
0 |
2 |
1 |
4 |
0 |
2 |
6 |
2 |
5 |
5 |
A |
0 |
0 |
0 |
1 |
4 |
7 |
3 |
2 |
9 |
5 |
4 |
T |
0 |
0 |
0 |
0 |
0 |
3 |
6 |
6 |
5 |
8 |
4 |
G |
0 |
3 |
0 |
0 |
0 |
2 |
6 |
5 |
5 |
8 |
11 |
A |
0 |
0 |
2 |
0 |
3 |
3 |
2 |
5 |
8 |
4 |
7 |
Fig. Xii. Local alignment with no negatives. The alignment is different from the above examples and corresponds only to the positive numbers in the matrix ending up with the maximum number.
-CAAG--TAGG- -CGAGCGTATG-
It is of course also possible to use local alignments for protein sequences as exemplified below:
|
D |
C |
A |
V |
D |
N |
T |
A |
H |
L |
M |
Q |
D |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
A |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
R |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
4 |
0 |
0 |
1 |
G |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
I |
0 |
0 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
2 |
1 |
0 |
D |
0 |
0 |
0 |
0 |
9 |
1 |
0 |
0 |
0 |
0 |
0 |
1 |
Q |
0 |
0 |
0 |
0 |
0 |
9 |
0 |
0 |
0 |
0 |
0 |
5 |
T |
0 |
0 |
0 |
0 |
0 |
0 |
14 |
0 |
0 |
0 |
0 |
0 |
V |
0 |
0 |
0 |
4 |
0 |
0 |
0 |
14 |
0 |
1 |
1 |
0 |
N |
0 |
0 |
0 |
0 |
1 |
6 |
0 |
0 |
15 |
0 |
0 |
1 |
T |
0 |
0 |
0 |
0 |
0 |
1 |
11 |
0 |
0 |
14 |
0 |
0 |
G |
0 |
0 |
4 |
0 |
0 |
0 |
0 |
11 |
1 |
0 |
11 |
0 |
T |
0 |
4 |
0 |
0 |
1 |
0 |
5 |
0 |
9 |
0 |
0 |
10 |
I |
0 |
0 |
0 |
3 |
0 |
0 |
0 |
4 |
0 |
11 |
1 |
0 |
C |
0 |
9 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
10 |
0 |
K |
0 |
0 |
8 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
11 |
Fig Xiii shows the alignment of two short segments of protein sequence. A segment of six amino acids give a total maximum score of 15 using the BLOSUM62 matrix and is presented as the output of the local alignment.
Alignment
VDNTAH IDQTVN