PSIBLAST

Introduction

Many functionally and evolutionarily important protein similarities are recognizable only through comparison of three-dimensional structures [1,2]. When such structures are not available, patterns of conservation identified from the alignment of related sequences can aid the recognition of distant similarities. There is a large literature on the definition and construction of these patterns, which have been variously called motifs, profiles, position-specific score matrices, and Hidden Markov Models [3-11]. In essence, for each position in the derived pattern, every amino acid is assigned a score. If a residue is highly conserved at a particular position, that residue is assigned a high positive score, and others are assigned high negative scores. At weakly conserved positions, all residues receive scores near zero. Position-specific scores can also be assigned to potential insertions and deletions [4,9,11].

The power of profile methods can be further enhanced through iteration of the search procedure [6-8,10]. After a profile is run against a database, new similar sequences can be detected. A new multiple alignment, which includes these sequences, can be constructed, a new profile abstracted, and a new database search performed. The procedure can be iterated as often as desired or until convergence, when no new statistically significant sequences are detected.

The design of PSI-BLAST

Iterated profile search methods have led to biologically important observations but, for many years, were quite slow and generally did not provide precise means for evaluating the significance of their results. This limited their utility for systematic mining of the protein databases. The principal design goals in developing the Position-Specific Iterated BLAST (PSI-BLAST) program [10] were speed, simplicity and automatic operation. The procedure PSI-BLAST uses can be summarized in five steps:

PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program [10].
The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions.
The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm [10,12] can be used for this directly.
PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale [13], and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments [14] remain applicable to profile alignments [10].
Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.

Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of BLAST; the results produced in iterative search steps are comparable to those produced from the first pass. Unlike most profile-based search methods, PSI-BLAST runs as one program, starting with a single protein sequence, and the intermediate steps of multiple alignment and profile construction are invisible to the user.

An example

PSI-BLAST uncovers many protein relationships missed by single-pass database- search methods and has identified relationships that were previously detectable only from information about the three-dimensional structure of the proteins [10,15,16]. Here, we illustrate how to operate PSI-BLAST by using a comparison of proteins from thermophilic archaea and bacteria as an example [17]. We employ the WWW version of PSI-BLAST.

Use Entrez to find the sequence of the uncharacterized protein MJ0414 from Methanococcus jannaschii [18] in FASTA format, and paste it into the PSI-BLAST Web page. At this point, you may immediately press the Submit Query button or, instead, first tailor the search. For example, you may change the substitution and gap costs, or the cutoff E-value that PSI-BLAST uses when constructing a profile for the next iteration. This default E-value is the rather conservative 0.001. Change it here to 0.01.

Examine the results of the program's initial gapped BLAST search. The only significant hits are very strong ones to the query sequence itself, and to uncharacterized proteins from three other archaea and the thermophilic bacteria Aquifex aeolicus. However, iterating the search by using the derived profile uncovers yeast DNA ligase II [19] with E-value 0.005, which is moderately significant. If you have used 0.01 as the cutoff E-value for recruitment of alignments into successive profiles, the ligase sequence is included at this stage. If you left the cutoff E-value at 0.001, PSI-BLAST reports convergence because no new sequences have alignments that pass this threshold. Nevertheless, by checking the box next to the yeast DNA ligase, you can force its inclusion in the construction of a PSI-BLAST profile, and run another iteration. Because a ligase has been used in constructing the query, the next iteration produces many highly significant alignments that involve other DNA ligases.

How do we interpret these results? Once a single sequence from a highly conserved family (here, the DNA ligases) is used in constructing a profile, the rest of the family will almost certainly be retrieved (and have E-values of high significance) in subsequent iterations. Impressive E-values for sequences retrieved in later iterations depend upon the validity of earlier inferences and therefore should not be taken as automatic proof of homology. In the example considered here, the best evidence for a possible relationship between the thermophile protein family and DNA ligases is the alignment produced in the first PSI-BLAST iteration (E = 0.005). This should be taken as a hint that requires corroboration. Fortunately, the PSI-BLAST alignment of our uncharacterized protein and yeast DNA ligase here provides such corroboration (Fig. 1). The best-conserved portions of the alignment correspond perfectly to the set of conserved motifs identified in ATP- dependent DNA ligases [20], including the catalytic lysine residue that forms a covalent adduct with AMP (Fig. 1). Although the E-values reported for the other ligase alignments do nothing to confirm the relationship, the alignments themselves conform to the conservation pattern shown in Fig. 1. Thus, we can conclude that the uncharacterized archaeal and A. aeolicus proteins probably comprise a new family of ATP-dependent DNA ligases. This finding is interesting both in itself and in the context of the apparently massive horizontal gene exchange between thermophilic archaea and bacteria [17].

Notes on using PSI-BLAST

The WWW version of PSI-BLAST requires the user to decide after each iteration whether to continue. In some respects this is a limitation, but it has the advantage that the user can hand-pick the sequences used for each profile construction, regardless of E-value, by checking boxes next to the sequences' descriptions. A stand-alone version of PSI-BLAST (obtainable from NCBI by anonymous FTP at ftp://ncbi.nlm.nih.gov/blast/executables/) allows the user to run the program for a chosen number of iterations or until convergence; it also allows the user to save the profile produced and use it subsequently to search another database.

PSI-BLAST is a powerful tool, but its use requires caution. The sources of error are the same as for standard BLAST but are easily amplified by iteration. The major source of deceptive alignments is the presence within proteins of regions with highly biased amino acid composition [21]. If such a region is included during production of a profile, otherwise unrelated sequences containing similarly biased regions will probably creep in during subsequent iterations, rendering the search nearly worthless. PSI-BLAST filters out biased regions of query sequences by default, using the SEG program [21]. Because the SEG parameters have been set to avoid masking potentially important regions, some bias may persist; PSI-BLAST can thus still generate compositionally rooted artifacts. These cases usually can be identified by inspection - especially when sequences that have a known bias, such as myosins or collagens, are retrieved. SEG (ftp://ncbi.nlm.nih.gov/pub/seg/seg/) can be used with parameters that eliminate nearly all biased regions [21], and the user can apply locally other filtering procedures, such as COILS [22] (which detects coiled-coil regions), before submitting the appropriately masked sequence to PSI-BLAST.

Exercise

Use Entrez to find the C-terminal region (approximately 215 residues) of human BRCA1 (SWISS-PROT accession number P38398) [23]. Search the NR protein database with this sequence using PSI-BLAST. What do the Xs in some alignments represent? Can the search be modified so that they do not appear? How many PSI-BLAST iterations can be performed before convergence? If dubious similarities pass the threshold for inclusion in profile construction during a given iteration, try removing them and check whether they reappear with significant similarity in the subsequent iteration. For published analyses of some of these similarities, see [10,24-26].

Adapted from:

Altschul, S.F. & Koonin, E.V. (1998) "Iterated profile searches with PSI-BLAST - a tool for discovery in protein databases." Trends Biochem. Sci. 23, 444-447.

References

[1] Holm, L. & Sander, C. (1997) "New structure - novel fold?" Structure 5:165-171. (PubMed)

[2] Brenner, S.E., Chothia, C. & Hubbard, T.J.P. (1998) "Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships." Proc. Natl. Acad. Sci. USA 95:6073-6078. (PubMed)

[3] Schneider, T.D., Stormo, G.D., Gold, L. & Ehrenfeucht, A. (1986) "Information content of binding sites on nucleotide sequences." J. Mol. Biol. 188:415-431. (PubMed)

[4] Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) "Profile analysis: detection of distantly related proteins." Proc. Natl. Acad. Sci. USAR 84:4355-4358. (PubMed)

[5] Staden, R. (1988) "Methods to define and locate patterns of motifs in sequences." Comput. Appl. Biosci. 4:53-60. (PubMed)

[6] Gribskov, M. (1992) "Translational initiation factor-IF-1 and factor-EIF-2-alpha share an RNA-binding motif with prokaryotic ribosomal protein-S1 and polynucleotide phosphorylase." Gene 119:107-111. (PubMed)

[7] Tatusov, R.L., Altschul, S.F. & Koonin, E.V. (1994) "Detection of conserved segments in proteins: Iterative scanning of sequence databases with alignment blocks." Proc. Natl. Acad. Sci. USA 91:12091-12095. (PubMed)

[8] Yi, T-M. and Lander, E.S. (1994) "Recognition of related proteins by iterative template refinement (ITR)." Prot. Sci. 3:1315-1328. (PubMed)

[9] Bucher, P., Karplus, K., Moeri, N. & Hofmann, K. (1996) "A flexible motif search technique based on generalized profiles." Comput. Chem. 20:3-23. (PubMed)

[10] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. (PubMed)

[11] Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998) "Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids." Cambridge University Press, Cambridge, UK.

[12] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410. (PubMed)

[13] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. (PubMed)

[14] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth. Enzymol. 266:460-480. (PubMed)

[15] Mushegian, A.R., Bassett, D.E. Jr., Boguski, M.S., Bork, P. & Koonin, E.V. (1997) "Positionally cloned human disease genes: patterns of evolutionary conservation and functional motifs." Proc. Natl. Acad. Sci. USA 94:5831-5836. (PubMed)

[16] Huynen, M., Doerks, T., Eisenhaber, F., Orengo, C., Sunyaev, S., Yuan, Y. & Bork, P. (1998) "Homology-based fold predictions for Mycoplasma genitalium proteins." J. Mol. Biol. 280:323-326. (PubMed)

[17] Aravind, L., Tatusov, R.L., Wolf , Y.I., Walker, D.R. and Koonin, E.V. (1998) "Evidence for massive gene exchange between archaeal and bacterial hyperthermophiles." Trends Genet., 14:442-444 (PubMed)

[18] Bult, C.J., White, O., Olsen, G.J., Zhou, L., Fleischmann, R.D., Sutton, G.G., Blake, J.A., FitzGerald, L.M., Clayton, R.A., Gocayne, J.D., Kerlavage, A.R., Dougherty, B.A., Tomb, J.F., Adams, M.D., Reich, C.I., Overbeek, R., Kirkness, E.F., Weinstock, K.G., Merrick, J.M., Glodek, A., Scott, J.L., Geoghagen, N.S.M. & Venter, J.C. (1996) "Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii." Science 273:1058-1073. (PubMed)

[19] Sterky, F., Holmberg, A., Pettersson, B. & Uhlen, M. (1996) "The sequence of a 30 kb fragment on the left arm of chromosome XV from Saccharomyces cerevisiae reveals 15 open reading frames, five of which correspond to previously identified genes." Yeast 12:1091-1095. (PubMed)

[20] Shuman, S. & Schwer, B. (1995) "RNA capping enzyme and DNA ligase: a superfamily of covalent nucleotidyl transferases." Mol. Microbiol. 17:405-410. (PubMed)

[21] Wootton, J.C. & Federhen, S. (1996) "Analysis of compositionally biased regions in sequence databases." Methods Enzymol. 266:554-571. (PubMed)

[22] Lupas, A. (1996) "Prediction and analysis of coiled-coil structures." Methods Enzymol. 266:513-525. (PubMed)

[23] Miki, Y., Swensen, J., Shattuck-Eidens, D., Futreal, P.A., Harshman, K., Tavtigian, S., Liu, Q., Cochran, C., Bennett, L.M., Ding, W., Bell, R., Rosenthal, J., Hussey, C., Tran, T., McClure, M., Frye, C., Hattier, T., Phelps, R., Haugen-Strano, A., Katcher, H., Yakumo, K., Gholami, Z., Shaffer, D., Stone, S., Bayer, S., Wray, C., Bogden, R., Dayananth, P., Ward, J., Tonin, P., Narod, S., Bristow, P.K., Norris, F.H., Helvering, L., Morrison, P., Rosteck, P., Lai, M., Barrett, J.C., Lewis, C., Neuhausen, S., Cannon-Albright, L., Goldgar, D., Wiseman, R., Kamb, A. & Skolnick, M.H. (1994) "A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1." Science 266:66-71. (PubMed)

[24] Koonin, E.V., Altschul, S.F. & Bork, P. (1996) "BRCA1 protein products: Functional motifs." Nature Genet. 13:266-268. (PubMed)

[25] Bork, P., Hofmann, K., Bucher, P, Neuwald, A.F., Altschul, S.F. & Koonin, E.V. (1997) "A superfamily of conserved domains in DNA damage-responsive cell cycle checkpoint proteins," FASEB J. 11:68-76. (PubMed)

[26] Callebaut, I. & Mornon, J.P. (1997) "From BRCA1 to RAP1: a widespread BRCT module closely associated with DNA repair." FEBS Lett. 400:25-30. (PubMed)

This material was stolen from http://www.ncbi.nlm.nih.gov/BLAST/tutorial/

Arne Elofsson

Last modified: Wed Mar 7 12:49:53 CET 2001