FASTA a fast approximation to Smith-Waterman

FASTA: A fast approximation to Smith-Waterman

FASTA is a two step algorithm (19). The first step is a search for highly similar seqments in the two sequences. In this search a word with a specific word size is used to find regions in a two-dimensional table table similar to that shown for the Smith-Waterman algorithm. . These regions are a diagonal or a few closely spaced diagonals in the table which have a high number of identical word matches between the sequences. The second step is a Smith-Waterman alignment centered on the diagonals that correspond to the alignment of the highly similar sequence segments. The region for the Smith-Waterman alignment is bounded by a window size. The window size limits the number of insertions or deletions one sequence can accumulate with respect to the other sequence in the alignment. Thus, the significant speedups in observed in a FASTA search relative to a full Smith-Waterman search is due to the prior restriction in alignment space as well as skipping the Smith-Waterman step when no diagonals are found that correspond to aolignments between highly similar sequence segments.

The FASTA algorithm is a heuristic approximation to the Smith-Waterman algorithm. The heuristics used by FASTA allow it to run much faster that the Smith-Waterman algorithm but at the cost of some sensitivity. Two heuristics are employed. Both can be interpreted as restrictions on the model of sequence evolution that is used in comparing the sequences. The first restriction is implemented by the word size parameter, usually two for proteins and six for nucleic acids. This means that FASTA constrains the evolution between a pair of sequences to preserve a number of unchanged dipeptides or hexanucleotides. The effect of this word size restriction can be seen in the dot plots shown below.

In the dot plots below each asterisks marks the beginning of a string of matches in the two short nucleotide sequence shown at the edges of the dot plot. The length of the matches thus varies between one and three nucleotides. As can be seen with a word size of one the dot plot is so noisy that it is difficult to pick out any region that might correspond to an alignment that accurately reflects homology between the sequences. However, when we shift to a word size of two the region more or less in the center of the dot plot stands out as offering the most possibilities for a long alignment with a high degree of sequence identity. In cases like this the loss of sensitivity resulting from increasing the word size has also resulted in a worthwhile increase in selectivity. In the final dot plot, with a word size of three, not only has most of the noise in the dot plot been removed but so has most of the information indicating the regions involved in the best alignment between the two sequences. Given the same scoring parameters the best alignment should be the same alignment found by the Smith-Waterman algorithm and shown above.

Dot Plot: Word Size = 1

    g  c  t  g  g  a  a  g  g  c  a  t
g   *     
  *  *        *  *
c      *                       *
a                  *
 *           *
g   *        *  *        *  *
a                  *  *    
      *
g   *        *  *        *  *
c      *                       *
a
                 *  *           *
c      *                       *
t    
    *                          *

Dot Plot: Word Size = 2

    g  c  t  g  g  a  a  g  g  c  a  t
g   *     
                 *
c                              *
a                   
 *
g               *
a                     *
g   *                  
    *
c                              *
a
c      *
t

Dot Plot: Word Size = 3

    g  c  t  g  g  a  a  g  g  c  a  t
g         
                 *
c
a
g
a
g                           *
c
a
c
t

The first step in the FASTA algorithm is to divide the query sequence into its constituent overlapping words of length two for proteins or six for nucleic acids. This process is shown in the table below. Then as each sequence is read from the database it is also divided into its constituent overlapping words. These two list of words are compared to find the identical words in both sequences. An initial score is computed based on the number of identities concentrated within small regions of the dot plot. If this initial score is high enough a second score is computed by evaluating which of the initial identities can be joined into a consistent alignment using only gaps of less than some maximum length. This maximum lengths for gaps is set by the window size parameter. Finally, if the secondary score is high enough then a Smith-Waterman alignment is performed within the region of the dot plot defined by the concentrated identities and the window size. This third score is reported as the optimal score.

Creating a Word List for FASTA, Word Size = 6


    g  c  t  g  g  a  a  g  g  c  a  t
    g  c  t  g  g  a
       c  t  g  g  a  a

          t  g  g  a  a  g
             g  g  a  a  g  g
                g  a  a  g  g  c

                   a  a  g  g  c  a
                      a  g  g  c  a  t

The weaknesses in the FASTA approach can be shown in two pathological examples. The first example would have two proteins that share 50% identity - but the proper alignment consists of alternating match and mismatches. With a word size of two, there would be no word matches along the main diagonal of the dot plot for the sequences (although there will potentially be random or spurious word matches on the off-diagonals) and the proper alignment would probably not be found. The second case consists of two proteins that are almost identical, except the second protein has a 20 residue insertion into the middle of the sequence. If the window size is 15, then the Smith-Waterman alignment phase of FASTA will not have enough alignment space to jump the 20 residue insertion. Thus, the resulting alignment will be either the sequence prior to or after the insertion (whichever had the higher diagonal scores) and the fact that the proteins were basically identical (with only one long insertion) will be missed.

The window size discussed in the previous paragraph is the second heuristic used by FASTA. Its effect is more variable than that of the word size. If the best alignment lies entirely within the window defined by the window size and the concentrated identities there is no effect. However if the best alignment, as defined by a full Smith-Waterman analysis, goes outside the window then a lower scoring alignment will be found by FASTA. This can lead the user to conclude that the sequences are not homologous when in fact they are and homology could have been inferred from a full Smith-Waterman alignment.

In practice these pathological cases are very unlikely. However, similar cases do occur and the loss of sensitivity caused by the use of these heuristics will be seen in the examples presented at the end of the tutorial.

References

Description of FASTA

Arne Elofsson

Last modified: Tue Oct 9 19:15:02 CEST 2001