FASTA Manual

NAME

fasta - scan a protein or DNA sequence library for similar sequences

tfasta - compare a protein sequence to a DNA sequence library, translating the DNA sequence library `on-the-fly'.

lfasta - compare two protein or DNA sequences for local similarity and show the local sequence alignments

plfasta - compare two sequences for local similarity and plot the local sequence alignments


SYNOPSIS

fasta [ -a -b # -c # -d # -f # -g # -l FASTLIBS -r STATFILE -m # -o -p # -Q -s SMATRIX -w # -x "# #" -y # -z -1 ] query-sequence-file library-file [ ktup ]

fasta [-Qabcdfghiklmnoprswxyz] query-file @library-name-file

fasta [-Qabcdfghiklmnoprswxyz] query-file "%PRMVI"

fasta [-abcdglmnoprswxy] - interactive mode

tfasta [-abcdfgkmoprsw3] protein-query-file DNA-library [ ktup ]

lfasta [-afgmnpswx] sequence-file-1 sequence-file-2 [ ktup ]

plfasta [-afgmnpsxv] sequence-file-1 sequence-file-2 [ ktup ]


DESCRIPTION

fasta is used to compare a protein or DNA sequence to all of the entries in a sequence library. For example, fasta can compare a protein sequence to all of the sequences in the NBRF PIR protein sequence database. fasta will automatically decide whether the query sequence is DNA or protein by reading the query sequence as protein and determining whether the `amino-acid composition' is more than 85% A+C+G+T. fasta uses an improved version of the rapid sequence comparison algorithm described by Lipman and Pearson (Science, (1985) 227:1427) that is described in Pearson and Lipman, Proc. Natl. Acad. USA, (1988) 85:2444. The program can be invoked either with command line arguments or in interactive mode. The optional third argument, ktup sets the sensitivity and speed of the search. If ktup=2, similar regions in the two sequences being compared are found by looking at pairs of aligned residues; if ktup=1, single aligned amino acids are examined. ktup can be set to 2 or 1 for protein sequences, or from 1 to 6 for DNA sequences. The default if ktup is not specified is 2 for proteins and 6 for DNA.

fasta compares a query sequence to a sequence library which consists of sequence data interspersed with comments, see below. Normally fasta and tfasta search the libraries listed in the file pointed to by the environment variable FASTLIBS. The format of this file is described in the file FASTA.DOC. tfasta compares a protein sequence to a DNA sequence database, translating the DNA sequence library in 6 frames `on-the-fly' (3 frames with the -3 option). The search uses the standard BLOSUM50 scoring matrix, and uses a ktup=2 by default. tfasta searches a DNA sequence database in the standard text format described below.

lfasta and plfasta programs compare two sequences looking for local sequence similarities. While fasta and tfasta report only the best alignment between the query sequence and the library sequence, lfasta and plfasta will report all of the alignments between the two sequences with scores greater than a cut-off value. lfasta shows the actual local alignments between the two sequences and their scores, while plfasta produces a plot of the alignments that looks similar to a `dot-matrix' homology plot. On Unix systems, plfasta generates tektronix output that can either be displayed on a tektronix terminal or piped through the tek2ps program for output on the laser printer. On MS-DOS systems, plfasta uses the graphics capabilities of the computer screen together with the *.BGI graphics device drivers supplied by Borland with Turbo `C'.

The fasta programs use a standard text format sequence file. Lines beginning with '>' or ';' are considered comments and ignored; sequences can be upper or lower case, blanks, tabs and unrecognizable characters are ignored. fasta expects sequences to use the single letter amino acid codes, see protcodes(1). Library files for fasta should have the form shown below.


OPTIONS

fasta and the other programs can be directed to change the scoring matrix, search parameters, output format, and default search directories by entering options on the command line (preceeded by a `-' or `/' for MS-DOS). All of the options should preceed the file name and ktup arguments). Alternately, these options can be changed by setting environment variables. The options and environment variables are:

-1
Normally, the top scoring sequences are ranked by their initn score. By using the -1 option, sequences are ranked by their init1 score.

-a
(SHOWALL) Modifies the display of the two sequences in alignments. Normally, both sequences are shown only where they overlap (SHOWALL=0); If -a or the environment variable SHOWALL = 1, both sequences are shown in their entirety.

-b #
The number of similarity scores to be shown when the -Q option is used. This value is usually calculated based on the actual scores.

-c #
(OPTCUT) The threshold for optimization with the -o option. The OPTCUT value is normally calculated based on sequence length.

-d #
The number of alignments to be shown. Normally, fasta shows the same number of alignments as similarity scores. By using fasta -Q -b 200 -d 50, one would see the top scoring 200 sequences and alignments for the 50 best scores.

-f #
Penalty for the first residue in a gap (-12 by default).

-g #
Penalty for additional residues in a gap (-2 by default).

-h
Do not display histogram of similarity scores.

-k #
(GAPCUT) Sets the threshold for joining the initial regions for calculating the initn score.

-l #
(FASTLIBS)The name of the library menu file. Normally this will be determined by the environment variable FASTLIBS. However, a library menu file can also be specified with -l.

-m #
(MARKX) =0,1,2,3,4. Alternate display of matches and mismatches in alignments. MARKX=0 uses ":","."," ", for identities, consevative replacements, and non-conservative replacements, respectively. MARKX=1 uses " ","x", and "X". MARKX=2 does not show the second sequence, but uses the second alignment line to display matches with a "." for identity, or with the mismatched residue for mismatches. MARKX=2 is useful for aligning large numbers of similar sequences. MARKX=3 writes out a file of library sequences in FASTA format. MARKX=3 should always be used with the "SHOWALL" (-a) option, but this does not completely ensure that all of the sequences output will be aligned. MARKX=4 displays a graph of the alignment of the library sequence with repect to the query sequence, so that one can identify the regions of the query sequence that are conserved.

-n
Forces the query sequence to be treated as a DNA sequence.

-o
Causes fasta to perform a limited optimization on all of the sequences in the library with initn scores greater than OPTCUT. This slows the program down about 5-fold, but, when combined with ktup=1, provides an extremely sensitive sequence comparison.

-Q
Quiet option. This allows fasta and tfasta to search a database and report the results without asking any questions. fasta -Q file library > output can be put in the background or run at a later time with the unix 'at' command. The number of similarity scores and alignments displayed with the -Q option can be modified with the -b (scores) and -d (alignments) options.

-r
(STATFILE) Causes fasta to write out the sequence identifier, superfamily number (if available), and similarity scores to STATFILE for every sequence in the library. These results are not sorted.

-s str
(SMATRIX) the filename of an alternative scoring matrix file. For protein sequences, BLOSUM50 is used by defualt; PAM250 can be used with the command line option -s 250.

-v str
(LINEVAL) (plfasta only) plfasta and pclfasta can use up to 4 different line styles to denote the scores of local alignments. The scores that correspond to these line styles can be specified with the environment variable LINVAL, or with the -v option. In either case, a string with three numbers separated by spaces should be given. This string must be surrounded by double quotation marks. For example, LINEVAL="200 100 50" tells plfasta to use solid lines for local alignments with scores greater than 200 long dashed lines for scores between 100 and 200, short dashed lines for scores between 50 and 100, and dotted lines for scores less than 50. The equivalent command line specification is plfasta -v "200 100 50" Normally, the values are 200, 100, and 50 for protein sequence comparisons and 400, 200, and 100 for DNA sequence comparisons.

-w #
(LINLEN) output line length for sequence alignments. (normally 60, can be set up to 200).

-x "offset1 offset2"
Causes fasta/lfasta/plfasta to start numbering the aligned sequences starting with offset1 and offset2, rather than 1 and 1. This is particularly useful for showing alignments of promoter regions.

-y #
Set the bandwidth used for optimization. -y 16 is the default for protein when ktup=2 and for all DNA alignments. -y 32 is used for protein and ktup=1. For proteins, optimization slows comparison 2-fold and is highly recommended.

-z
Do not do statistical significance calculation.

-3
tfasta only. Normally tfasta translate sequences in the DNA sequence library in all six frames. With the -3 option, only the three forward frames are searched.

EXAMPLES

(1) fasta musplfm.aa $AABANK

Compare the amino acid sequence in the file musplfm.aa with the complete PIR protein sequence library using ktup=2. Each "library" sequence (there need only be one) should start with a comment line which starts with a '>', e.g.

     >LCBO bovine preprolactin
     WILLLSQ ...
     >LCHU human ...
     ...

(2) fasta -a -w 80 musplfm.aa lcbo.aa 1

Compare the amino acid sequence in the file musplfm.aa with the sequences in the file lcbo.aa using ktup=1. Show both sequences in their entirety, with 80 residues on each output line.

(3) fasta

Run the fasta program in interactive mode. The program will prompt for the file name for the query sequence, list alternative libraries to be seached (if FASTLIBS is set), and prompt for the ktup.


FILES

This version of fasta prompts for the library file to be searched from a list of file names that are saved in the file pointed to by the environment variable FASTLIBS. If FASTLIBS = fastgb.list, then the file fastgb.list might have the entries:

     NBRF Protein$0P/u/lib/aabank.lib 0
     GB Primate$1P@/u/lib/gpri.nam
     GB Rodent$1R@/u/lib/grod.nam
     GB Mammal$1M@/u/lib/gmammal.nam

Each line in this file has 4 fields: (1) The library name, separated from the remaining fields by a '$'; (2) A 0 or a 1 indicating protein or DNA library respectively; (3) A single letter that will be used to choose the library; (4) the location of the library file itself (the library file name can contain an optional library format specfier. fasta recognizes the following library formats:

Note that this fourth field can contain an '@' character, which indicates that the library file is an indirect library file containing list of library files, one per line. An indirect library file can also contain a line beginning with the symbol '<', followed by the directory where the library files may be found, and a line beginning with a '>', indicating the name of the index file (GENBANK compressed floppy format files only). An indirect library file might have the lines:

     </usr/slib/genbank  (the directory for the library files)
     >glocus.idx         (index file for GENBANK binary files)
     gpri1.seq 9
     gpri2.seq 9
     gpri3.seq 9
     ...
     grod1.seq 9
     ...

This version of fasta can also distinguish between normal text library files (as shown above in EXAMPLE (2)), and DNA libraries in the GENBANK compressed floppy disk format. These latter files are binary files that are distributed by Intelligenetics on floppy disks. Earlier versions of fasta (and fastn before it) used different programs to read the text library files (old fasta or ifastn) and the compressed files (old fastgb and gfastn). These routines have been combined in the current fasta.

You can use your own sequence files for fasta, just be certain to put a '>' and comment as the first line before the sequence. Only one library file type, the standard NBRF library format, is supported by the VAX/VMS programs. lfasta and plfasta do not required the '>' and comment line. fasta does.


SEE ALSO

rrdf2(1), protcodes(5), dnacodes(5)


AUTHOR

Bill Pearson
wrp@virginia.EDU

Created by Tod M. Klingler, klingler@cmgm.stanford.edu