Many functionally and evolutionarily important protein similarities are recognizable only through comparison of three-dimensional structures [1,2]. When such structures are not available, patterns of conservation identified from the alignment of related sequences can aid the recognition of distant similarities. There is a large literature on the definition and construction of these patterns, which have been variously called motifs, profiles, position-specific score matrices, and Hidden Markov Models [3-11]. In essence, for each position in the derived pattern, every amino acid is assigned a score. If a residue is highly conserved at a particular position, that residue is assigned a high positive score, and others are assigned high negative scores. At weakly conserved positions, all residues receive scores near zero. Position-specific scores can also be assigned to potential insertions and deletions [4,9,11].
The power of profile methods can be further enhanced through iteration of the search procedure [6-8,10]. After a profile is run against a database, new similar sequences can be detected. A new multiple alignment, which includes these sequences, can be constructed, a new profile abstracted, and a new database search performed. The procedure can be iterated as often as desired or until convergence, when no new statistically significant sequences are detected.
Iterated profile search methods have led to biologically important observations but, for many years, were quite slow and generally did not provide precise means for evaluating the significance of their results. This limited their utility for systematic mining of the protein databases. The principal design goals in developing the Position-Specific Iterated BLAST (PSI-BLAST) program [10] were speed, simplicity and automatic operation. The procedure PSI-BLAST uses can be summarized in five steps:
Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of BLAST; the results produced in iterative search steps are comparable to those produced from the first pass. Unlike most profile-based search methods, PSI-BLAST runs as one program, starting with a single protein sequence, and the intermediate steps of multiple alignment and profile construction are invisible to the user.
PSI-BLAST uncovers many protein relationships missed by single-pass database- search methods and has identified relationships that were previously detectable only from information about the three-dimensional structure of the proteins [10,15,16]. Here, we illustrate how to operate PSI-BLAST by using a comparison of proteins from thermophilic archaea and bacteria as an example [17]. We employ the WWW version of PSI-BLAST.
Use Entrez to find the sequence of the uncharacterized protein MJ0414 from Methanococcus jannaschii [18] in FASTA format, and paste it into the PSI-BLAST Web page. At this point, you may immediately press the Submit Query button or, instead, first tailor the search. For example, you may change the substitution and gap costs, or the cutoff E-value that PSI-BLAST uses when constructing a profile for the next iteration. This default E-value is the rather conservative 0.001. Change it here to 0.01.
Examine the results of the program's initial gapped BLAST search. The only significant hits are very strong ones to the query sequence itself, and to uncharacterized proteins from three other archaea and the thermophilic bacteria Aquifex aeolicus. However, iterating the search by using the derived profile uncovers yeast DNA ligase II [19] with E-value 0.005, which is moderately significant. If you have used 0.01 as the cutoff E-value for recruitment of alignments into successive profiles, the ligase sequence is included at this stage. If you left the cutoff E-value at 0.001, PSI-BLAST reports convergence because no new sequences have alignments that pass this threshold. Nevertheless, by checking the box next to the yeast DNA ligase, you can force its inclusion in the construction of a PSI-BLAST profile, and run another iteration. Because a ligase has been used in constructing the query, the next iteration produces many highly significant alignments that involve other DNA ligases.
How do we interpret these results? Once a single sequence from a highly conserved family (here, the DNA ligases) is used in constructing a profile, the rest of the family will almost certainly be retrieved (and have E-values of high significance) in subsequent iterations. Impressive E-values for sequences retrieved in later iterations depend upon the validity of earlier inferences and therefore should not be taken as automatic proof of homology. In the example considered here, the best evidence for a possible relationship between the thermophile protein family and DNA ligases is the alignment produced in the first PSI-BLAST iteration (E = 0.005). This should be taken as a hint that requires corroboration. Fortunately, the PSI-BLAST alignment of our uncharacterized protein and yeast DNA ligase here provides such corroboration (Fig. 1). The best-conserved portions of the alignment correspond perfectly to the set of conserved motifs identified in ATP- dependent DNA ligases [20], including the catalytic lysine residue that forms a covalent adduct with AMP (Fig. 1). Although the E-values reported for the other ligase alignments do nothing to confirm the relationship, the alignments themselves conform to the conservation pattern shown in Fig. 1. Thus, we can conclude that the uncharacterized archaeal and A. aeolicus proteins probably comprise a new family of ATP-dependent DNA ligases. This finding is interesting both in itself and in the context of the apparently massive horizontal gene exchange between thermophilic archaea and bacteria [17].
The WWW version of PSI-BLAST requires the user to decide after each iteration whether to continue. In some respects this is a limitation, but it has the advantage that the user can hand-pick the sequences used for each profile construction, regardless of E-value, by checking boxes next to the sequences' descriptions. A stand-alone version of PSI-BLAST (obtainable from NCBI by anonymous FTP at ftp://ncbi.nlm.nih.gov/blast/executables/) allows the user to run the program for a chosen number of iterations or until convergence; it also allows the user to save the profile produced and use it subsequently to search another database.
PSI-BLAST is a powerful tool, but its use requires caution. The sources of error are the same as for standard BLAST but are easily amplified by iteration. The major source of deceptive alignments is the presence within proteins of regions with highly biased amino acid composition [21]. If such a region is included during production of a profile, otherwise unrelated sequences containing similarly biased regions will probably creep in during subsequent iterations, rendering the search nearly worthless. PSI-BLAST filters out biased regions of query sequences by default, using the SEG program [21]. Because the SEG parameters have been set to avoid masking potentially important regions, some bias may persist; PSI-BLAST can thus still generate compositionally rooted artifacts. These cases usually can be identified by inspection - especially when sequences that have a known bias, such as myosins or collagens, are retrieved. SEG (ftp://ncbi.nlm.nih.gov/pub/seg/seg/) can be used with parameters that eliminate nearly all biased regions [21], and the user can apply locally other filtering procedures, such as COILS [22] (which detects coiled-coil regions), before submitting the appropriately masked sequence to PSI-BLAST.
Use Entrez to find the C-terminal region (approximately 215 residues) of human BRCA1 (SWISS-PROT accession number P38398) [23]. Search the NR protein database with this sequence using PSI-BLAST. What do the Xs in some alignments represent? Can the search be modified so that they do not appear? How many PSI-BLAST iterations can be performed before convergence? If dubious similarities pass the threshold for inclusion in profile construction during a given iteration, try removing them and check whether they reappear with significant similarity in the subsequent iteration. For published analyses of some of these similarities, see [10,24-26].