BLAST

BLAST: A simplification of Smith-Waterman

The BLAST algorithm (20) uses a word based heuristic similar to that of FASTA to approximate a simplification of the Smith-Waterman algorithm known as the maximal segment pairs algorithm. Maximal segment pairs alignments do not allow gaps and are specified by an equation that includes only the first and fourth terms of the Smith-Waterman equation presented above. Maximal segment pair alignments have the very valuable property that their statistics are well understood (15). Thus, we can readily compute a significance probability for a maximal segment pair alignment. Recent advances in maximal segment pairs statistics allow the use of several independent segment alignments to be used in evaluating the significance probability.

The price for being able to readily compute a significance probability is that the alignments can not have gaps (15). Thus, the evolutionary model requires that there be a fairly long stretch of sequence that has evolved without insertions or deletions, or at least with a complimentary pattern of insertions and deletions that do not significantly disrupt the alignment.

The BLAST algorithm is less sensitive than Smith-Waterman and in appropriate circumstances more selectivity. For proteins the BLAST word based heuristic is more sensitive than the FASTA heuristic even though BLAST uses a word size of three for proteins while FASTA uses a word size of two. However, BLAST uses a very long word size, eleven, for nucleic acid sequences. Also the modifications to the heuristic that improve the sensitivity for protein sequences do not work as well for nucleic acid sequences because have only a four letter alphabet and the similarity scores are usually calculated with equal rates of replacement for all of the nucleotides. Thus, FASTA is more sensitive than BLAST for nucleic acid sequences and should used instead of BLAST.

The BLAST word based heuristic uses a default word size of three for proteins and eleven for nucleic acid sequences. The tables below illustrate how the BLAST heuristic differs from the FASTA heuristic using a word size of two applied to a short protein query sequence. A word size of two is used in the example to keep it to a managable size.

Creating a Word List for BLAST, Word Size = 2

Adipokinetic hormone II - Migratory locust

 q l n f s a g w

 q l
   l n
     n f
       f s
         s a
           a g
             g w

In the BLAST heuristic this list of words is expanded in order to recover the sensitivity lost by matching only identical words. Any word that scores at least a minimum threshold when aligned with any of the initial list of words is added to the list. BLAST then examines the database sequences for words that exactly match any of the expanded list. The expanded below to a total list of 47 words derived from the initial 7. The expanded list contains any word that scores at least eight when aligned with the initial word and scored with the PAM 120 similarity table.

Initial
  Word    Expanded List
    ql:   ql, qm, hl, zl

    ln:   ln, lb
    nf:   nf, af, ny, df, qf, ef, gf, hf, kf, sf, tf, bf, zf

    fs:   fs, fa, fn, fd, fg, fp, ft, fb, ys
    sa:  no words score 8 or more, including the initial word sa

    ag:   ag

    gw:   gw, aw, rw, nw, dw, qw, ew, hw, iw, kw, mw, pw, sw,
          tw, vw,bw, zw, xw

One noteworthy feature of the table above is that there is no word that scores eight or more when aligned with the initial word "sa". This situation does occur in actual BLAST searches. The user has the option to force the initial word into the final list. The default is not to include such low scoring words because they contain so little information that they are unlikely to be critical in finding the maximal segment pair alignment. In practice, BLAST has used a word length of 3 for protein database searches with a required threshold score of 13 using the Blosum62 similarity scoring matrix.

Recent Developments in BLAST: More Sensitivity and Gaps

The inventors of the BLAST algorithm have continued to look for ways to improve both the sensitivity and the speed of their algorithm (23). As we will see later, the growth of the sequence databases has raised the miniimum score and hence the length of alignment that must be found by BLAST for a match to be significant. For these longer alignments both the speed and the sensitivity of BLAST can be improved by requiring that the algorithm find two matches above some (lower) threshold rather than one match. Both matches must be on the same diagonal. The increase in speed results from improved specificity so that fewer sequences are completely evaluated In practice BLAST now looks for two words of length 3 that each score at least 11 using the Blosum62 similarity scoring matrix. The two matches must be within 40 amino acids of each other on the same diagonal.

Another recent improvement in the BLAST algorithm is to provide a rigorous gapped alignment as part of the results. The first steps that BLAST made towards providing gapped alignments was to find a number of high scoring segments in addition to the maximal scoring segment. A later version used a strategy similar to that found in FASTA. In this strategy the maximal scoring segment is used to define a band in the path graph and the Smith-Waterman (17) algorithm is used to find a gapped alignment that lays entirely within the band. However, as noted in the discussion of FASTA, this strategy can produce a less than optimal Smith-Waterman alignment if the number of gaps needed in the best gapped alignment is greater than the width of the band.

The new gapped BLAST gets around this problem and still avoids the high computational cost of a full Smith-Waterman alignment on the pair of sequences by building the alignment out from a central high scoring pair of aligned amino acids in a way analogous to BLAST extends the initial maximal segment pair alignment. The initial pair of aligned amino acids is chosen as the middle pair of the highest scoring window of 11 amino acids in the high scoring segment pair alignments. The Smith-Waterman alignment is then extended in all directions in the path graph until it falls below a fixed percentage of the highest score yet computed in the Smith-Waterman phase.

This method of computing the Smith-Waterman gapped alignment will always find the best scoring Smith-Waterman alignment if two conditions are met. First, the calculation is extended until a lower bounds of a score of zero is reached. Using a higher threshold for stopping the calculation accepts a very small risk of not finding the complete alignment in return for a very large savings in computer resources. The second criterion that must be met is that the initial pair of amino acids selected as the midpoint from which to extend the alignment must actually be part of the alignment that would be reported as the best alignment by a complete Smith-Waterman alignment of the pair of sequences. Again the selection criteria used by gapped BLAST appear to be very effective in selecting an appropriate pair of amino acids from which to extend the alignment.

References

Arne Elofsson

Last modified: Tue Oct 9 19:12:29 CEST 2001