Scoring Matrices and gap penalties

Introduction

So far we have only discussed the computational aspects of alignments and not the biological aspects. However, to obtain as much information as possible it is necessary to try to include as much biological information as possible. Some important observations from how genes evolve are:

  1. A gap and its length are distinct quantities. Different weights should be applied to each. The reason for this is that it if a gap is introduces it is much more likely that it is prolonged than that another gap is formed.
  2. Weights for different mismatches should be permitted. A transition is more likely than a transversion; Ile-Val more likely than Ile-Arg change.
  3. As a large part of the evolutionary process does work on the domain level, and not on the complete sequence level. Therefore, unless two sequences are known to be homologous over their entire length, a local alignment is usually preferred to global alignment.
  4. An alignment demonstrates similarity, not necessarily, homology. Homology is an evolutionary inference based on examination of the similarity and its biological meaning. Sequence similarity may result from homology but it may also result from chance or analogy.

Statistical model

When we align two sequences (x and y with an alignment XY), we are actually testing two different hypothesis. Either that the alignment XY occurred by chance or that the two sequences are homologous.

Scoring matrices

All alignment methods need some sort of scoring for matches and mismatches. For a certain alignment, a number is assigned to each position in the sequence depending on the match at that position. The scores for all position in the alignment are then added to calculate a total score, which is used to select the optimal alignment among alternative alignments. The simplest way of scoring is to assign one number for a match and another number for a mismatch. Such a matrix is ofter referred to as a unitary matrix.

For nucleic acid alignments, a unitary scoring matrices might be adequate. Changes in amino acid sequences are generally more informative than changes in the base sequence, since protein function and thereby the possibilities of survival is directly related to the nature of the residue. A change from a Valine to an isoleucine (for example) is more likely to be found than a Valine to an Aspartate change. For amino acid sequences, the alignment programs therefore utilize scoring matrices, which in a direct or indirect way contain information about the likelihood of a certain change. For closely related proteins, however, a matrix scoring only for identities will mostly give the same alignment.

Many alternatives to the unitary scoring matrix have been suggested. One of the earliest suggestions was scoring matrix based on the minimum number of bases that must be changed to convert a codon for one amino acid into a codon for a second amino acid. This matrix, known as the minimum mutation distance matrix, has succeeded in identifying more distant relationships among protein sequences than the unitary matrix approach. The minimal mutation distance matrix is an improvement because it incorporates knowledge about the process of generating mutations from one amino acid into another. However it still ignores the processes of selection that determine which mutations will survive in a population.

Another improvement is a scoring matrix based on selected physical, chemical, or structural properties shared and not shared by the 220 pairs of amino acids. Specific instances of this approach work well for sequences that have been strongly conserved during the evolution. The intuitive scheme developed by McLachlan classified amino acid on the basis of polar or non-polar character, size, shape, and charge, and gives a score of six to interconversions between identical rare amino acids (e.g. F, F) reducing to zero for substitutions between amino acids of quite different character (e. g. F, E) . Another approach similar to McLachlan is by combining information from the structural features of amino acids and genetic code. However this approach suffers from problems with balancing the contributions of the different properties to the positive selection of mutations and from ignoring the different rates at which different mutations are generated. As we can see a metric of similarity between amino acid pairs is important, and not only the choice of which scoring matrix is important. It is also important that all subsequent results depend critically on just how this is done and what model lies at the basis for the construction of a specific scoring matrix.

The most common scoring matrices are developed from a statistical theory

There are some reasons why you should care about scoring matrices:

PAM

The most important improvement achieved over the unitary matrix was based on evolutionary distances. Margaret Dayhoff pioneered this approach in the 1970's. She made an extensive study of the frequencies in which amino acids substitute for each other during evolution. The studies involved carefully aligning all of the proteins in several families of proteins and then constructing phylogenetic trees for each family. This lead to a table of relative frequencies with which amino acids replace each other over an evolutionary period. This table and the relative frequency of occurrence of amino acids in the proteins studied were combined in computing the PAM (point Accepted Mutation) family of scoring matrices.

The PAM series are based on estimated mutation rates (Percent Accepted Mutations) from closely related proteins and will therefore be dominated by amino acid mutations caused by single base changes. It is also called a log-odds matrix, since the numbers are proportional to the logarithm of the "odds" for the replacement not being a random change. PAM 1 stands for 1 % accepted mutations. PAM matrices for less similar sequences are obtained by extrapolations. The PAM100 matrix corresponds to 100 accepted mutations per 100 residues, but since the same residue might change more than once, two sequences with this level of mutations will have about 50 % identities. In the same way, the PAM250 matrix corresponds to a level of about 20 % identical residues.

The PAM matrices have theoretical advantages over alternative methods of scoring alignments. From biological point of view PAM matrices are based on observed mutations. Thus they contain information about the processes that generate mutations as well as the criteria that are important in selection and fixing mutation with a population. Another advance of this type of scoring matrix is from the statistical point of view. PAM matrices contain accurate description of the changes in amino acid composition that are expected after a given number of mutations that can be derived from the data used in creating the matrices. Thus the highest scoring alignment is statistically most likely to have been generating by evolution rather than by chance. The PAM matrices and other substitution matrices discussed below are generally presented as log-odds matrices. That is each score in the matrix is the logarithm of an odds ratio. The odds ratio used is the ratio of the number of times residue "A" is observed to replace residue "B" divided by the number of times residue "A" would be expected to replace residue "B" if the replacement occurred at random. Thus positive scores in the matrix designate a pair of residues that replace each other more often than expected by chance. Negative scores in the matrix designate pairs of residues that replace each other less often than would be expected by chance and are evidence against the sequences being homologous. One can use the matrix to objectively select groups of amino acids that represent conservative substitutions in proteins, because it summarizes the observed replacement that have taken place while conserving the structural and functional properties of proteins. Finally the PAM matrix provides an empirical, experimental determination of conserved replacement and it is more suitable for scoring an alignment than matrices mentioned above.

BLOSUM

One assumption inherent in the Dayhoff model is that the evolutionary rates are uniform over the whole of protein sequence. That is likely not to be true, because the evolutionary rates are obviously lower in conserved regions and higher in non-conserved regions of proteins. Henikoff and Henikoff [8] have approached this in a different way. Their idea was to get a better measure of differences between two proteins specifically for more distantly related proteins. They used the BLOCKS database to search for differences among sequences but only among the very conserved regions of a protein family. Hence the term BLOSUM is from BLOcks SUbstitution Matrix. They first collect all of the sequences in the BLOCKS database and then for each one they sum the number of amino acids in each site to get a frequency table of how often different pairs of amino acids are found together in these conserved regions. These frequencies can be written down in a frequency table, and the odds for relatedness is calculated in a similar way as for Dayhoff matrix.

Different evolutionary distances are incorporated into this scheme with a clustering procedure: two sequences that are identical to each other for more than a certain threshold of positions are clustered. More sequences are added to the cluster if they are identical to any sequence already in the cluster at the same level. This lead to a series of matrices. The matrices are called BLOSUM matrices, with an index denoting the level of clustering, for example BLOSUM62 is derived from sequence blocks clustered at the 62% identity level.

The BLOSUM series are calculated from blocks of aligned sequences from homologous proteins with a certain level of identity. In this way, the pattern of changes observed at a certain level of identity is used, instead of extrapolations from patterns for closely related proteins. The figure below shows the BLOSUM 62 matrix. The score for a certain pair of aligned residues is found at the intersection of the row and column corresponding to the two amino acids. The numbers on the diagonal show the score for residues that have not changed, and one can note that there is a higher score for retaining a tryptophan (W) than a valine (V). Most changes correspond to negative numbers, but some conservative changes also have positive scores.

   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4

Example of an alignment scoring matrix (The BLOSUM 62 matrix).

Gaps

The score from all pairs of aligned residues are combined with suitable penalties for introducing gaps to calculate a total score, which is used to select the optimal alignment. The gap penalties are normally determined by two parameters, one for opening of a gap and one that gives a penalty proportional to the length of the gap. Most programs allow the user to choose these parameters, which might have different optima for different systems. They also depend on the scoring matrix used.

The significance of an alignment score can be estimated by repeating the alignment with randomized sequences of the same composition. A number of such alignments will give a mean score and an estimate of the expected variation of the score for random sequences similar to the aligned sequences. From this, the obtained score for the alignment can be expressed as number of standard deviations above the mean.


Copyright © 2000 Lars Liljas Arne Elofsson
Arne Elofsson
Last modified: Tue Oct 30 09:28:13 CET 2001