Database Searching: An Overview

Database searching is different from a sequence alignment.

Although a database search typically yields several alignments as part of its results these alignments are not really our goal. Our goal is the scores accompanying these alignments. We want to use these scores to distinguish sequences related to our query sequence from sequences that are not related to our query sequence. Because of this we will select parameters for database searching that give us the scores that best distinguish the related from the unrelated sequences in the database. Generally, these are not the parameters that will give us the most accurate alignment, the alignment that is our best estimate of the actual history of substitutions, insertions, and deletions in the two sequences.

The goal of a database search is almost always to find homologous genes or proteins and NOT FIND non-homologous ones. Homologous genes are genes that have a common ancestor. Evolution is a conservative process, therefore although the sequence residues may change, the chemical and physical properties necessary to maintain very similar biochemical and physiological processes are conserved. Thus, if we are able to discover an already characterized sequence in the database that is homologous to our newly determined gene or protein it is quite likely that our newly determined sequence carries out biochemical and physiological processes very much like those of the well characterized sequence in the database. It should also be taken into account that that paralogous genes, i.e. genes that have appeared after a gene duplication event, often have evolved to perform different functions although they are homologous.

To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. In this context, "chance" can mean the comparison of (i) real but non-homologous sequences; (ii) real sequences that are shuffled to preserve compositional properties or (iii) sequences that are generated randomly based upon a DNA or protein sequence model. Analytic statistical results invariably use the last of these definitions of chance, while empirical results based on simulation and curve-fitting may use any of the definitions.

One of the aims of sequence alignments is to determine whether proteins are homologous or not. The alignment score is an estimate of the similarity between two sequences, but it will depend on the scoring matrix and the gap penalties as well as on the composition of the sequences. Alignments of many randomized sequences with the same compositions will give a distribution of scores, and from this distribution one can by normal statistical methods estimate how likely it is that the observed score could be obtained by chance. This assumes that truly unrelated sequences have no systematic similarities. This assumption might not be completely correct, since in b sheets or helices at the surface of proteins, there is a tendency for certain regularity in the distribution of side chains due to the packing of non-polar side chains in the hydrophobic core. Since there are proteins with the same fold but with no other evidence of a common ancestry (structural convergence), there is a possibility that also sequence similarity could be the result of convergence from unrelated ancestors. Although there have been examples of convergent adaptive changes of a small number of residues, sequence convergence for whole proteins and domains has not been observed. A significant sequence similarity for a domain can therefore always be used to prove homology.

Comparisons of many sequences and structures has led to a rule of thumb that a 25 % sequence identity between two sequences can be regarded as a reasonable evidence of homology, at least for sequences of more than 200 residues. For shorter sequences, a still higher level of identity is needed for a conclusive decision of homology . Below that number, there is a gray zone between 15 and 25 % identity where the sequence similarity still might be an indication of a relation, but a random similarity is also possible. However recent studies have clearly showed that fraction identity is non optimal measure for judging homology. Much better results are obtained if a proper statistical model is used, however this (bad) habit of reporting fraction identical residues still exist. In the next session we will describe such a theory.

Parameters to consider when we search a database

The similarity scores used to assess sequence likeness are the most important source of previous knowledge in database searching.

These include the widely used PAM and BLOSUM similarity matrices used in protein database searches as well as PAM matrices and other scores used to search nucleic acid databases. These matrices incorporate many observations of which amino acids have replaced each other while the proteins were evolving in different species but still maintaining the same biochemical and physiological functions. They rescue us from the ignorance of having to assume that all amino acid changes are equally likely and equally harmful. Different similarity matrices are appropriate for different degrees of evolutionary divergence. Any matrix is most likely to find good matches with other sequences that have diverged from your query sequence to the extent for which the matrix is suited (4). Similar matrices are available, if not widely used, for DNA. The DNA matrices can incorporate knowledge about differential rates of transitions and transversions in the same way that some substitutions are judged more favorable than others in protein similarity matrices.

The choice of substitution matrix

The results a local alignment program produces depend strongly upon the scores it uses. No single scoring scheme is best for all purposes, and an understanding of the basic theory of local alignment scores can improve the sensitivity of one's sequence analysis. As before, the theory is fully developed only for scores used to find ungapped local alignments, so we start with that case.

A large number of different amino acid substitution scores, based upon a variety of rationales, have been described [23-36]. However the scores of any substitution matrix with negative expected score can be written uniquely in the form.

where the q_ij, called target frequencies, are positive numbers that sum to 1, the p_i are background frequencies for the various residues, and lambda is a positive constant [10,31]. The lambda here is identical to the lambda of equation (1).

Multiplying all the scores in a substitution matrix by a positive constant does not change their essence: an alignment that was optimal using the original scores remains optimal. Such multiplication alters the parameter lambda but not the target frequencies q_ij. Thus, up to a constant scaling factor, every substitution matrix is uniquely determined by its target frequencies. These frequencies have a special significance [10,31]:

A given class of alignments is best distinguished from chance by the substitution matrix whose target frequencies characterize the class.

To elaborate, one may characterize a set of alignments representing homologous protein regions by the frequency with which each possible pair of residues is aligned. If valine in the first sequence and leucine in the second appear in 1% of all alignment positions, the target frequency for (valine, leucine) is 0.01. The most direct way to construct appropriate substitution matrices for local sequence comparison is to estimate target and background frequencies, and calculate the corresponding log-odds scores of formula (6). These frequencies in general can not be derived from first principles, and their estimation requires empirical input.

The PAM and BLOSUM amino acid substitution matrices

While all substitution matrices are implicitly of log-odds form, the first explicit construction using formula (6) was by Dayhoff and coworkers [24,25]. From a study of observed residue replacements in closely related proteins, they constructed the PAM (for "point accepted mutation") model of molecular evolution. One "PAM" corresponds to an average change in 1% of all amino acid positions. After 100 PAMs of evolution, not every residue will have changed: some will have mutated several times, perhaps returning to their original state, and others not at all. Thus it is possible to recognize as homologous proteins separated by much more than 100 PAMs. Note that there is no general correspondence between PAM distance and evolutionary time, as different protein families evolve at different rates.

Using the PAM model, the target frequencies and the corresponding substitution matrix may be calculated for any given evolutionary distance. When two sequences are compared, it is not generally known a priori what evolutionary distance will best characterize any similarity they may share. Closely related sequences, however, are relatively easy to find even will non-optimal matrices, so the tendency has been to use matrices tailored for fairly distant similarities. For many years, the most widely used matrix was PAM-250, because it was the only one originally published by Dayhoff.

Dayhoff's formalism for calculating target frequencies has been criticized [27], and there have been several efforts to update her numbers using the vast quantities of derived protein sequence data generated since her work [33,35]. These newer PAM matrices do not differ greatly from the original ones [37].

An alternative approach to estimating target frequencies, and the corresponding log-odds matrices, has been advanced by Henikoff & Henikoff [34]. They examine multiple alignments of distantly related protein regions directly, rather than extrapolate from closely related sequences. An advantage of this approach is that it cleaves closer to observation; a disadvantage is that it yields no evolutionary model. A number of tests [13,37] suggest that the "BLOSUM" matrices produced by this method generally are superior to the PAM matrices for detecting biological relationships.

DNA substitution matrices

While we have discussed substitution matrices only in the context of protein sequence comparison, all the main issues carry over to DNA sequence comparison. One warning is that when the sequences of interest code for protein, it is almost always better to compare the protein translations than to compare the DNA sequences directly. The reason is that after only a small amount of evolutionary change, the DNA sequences, when compared using simple nucleotide substitution scores, contain less information with which to deduce homology than do the encoded protein sequences [32].

Sometimes, however, one may wish to compare non-coding DNA sequences, at which point the same log-odds approach as before applies. An evolutionary model in which all nucleotides are equally common and all substitution mutations are equally likely yields different scores only for matches and mismatches [32]. A more complex model, in which transitions are more likely than transversions, yields different "mismatch" scores for transitions and transversions [32]. The best scores to use will depend upon whether one is seeking relatively diverged or closely related sequences [32].

Gap scores

Our theoretical development concerning the optimality of matrices constructed using equation (6) unfortunately is invalid as soon as gaps and associated gap scores are introduced, and no more general theory is available to take its place. However, if the gap scores employed are sufficiently large, one can expect that the optimal substitution scores for a given application will not change substantially. In practice, the same substitution scores have been applied fruitfully to local alignments both with and without gaps. Appropriate gap scores have been selected over the years by trial and error [13], and most alignment programs will have a default set of gap scores to go with a default set of substitution scores. If the user wishes to employ a different set of substitution scores, there is no guarantee that the same gap scores will remain appropriate. No clear theoretical guidance can be given, but "affine gap scores" [38-41], with a large penalty for opening a gap and a much smaller one for extending it, have generally proved among the most effective.

Low complexity sequence regions

There is one frequent case where the random models and therefore the statistics discussed here break down. As many as one fourth of all residues in protein sequences occur within regions with highly biased amino acid composition. Alignments of two regions with similarly biased composition may achieve very high scores that owe virtually nothing to residue order but are due instead to segment composition. Alignments of such "low complexity" regions have little meaning in any case: since these regions most likely arise by gene slippage, the one-to-one residue correspondence imposed by alignment is not valid. While it is worth noting that two proteins contain similar low complexity regions, they are best excluded when constructing alignments [42-44]. The BLAST programs employ the SEG algorithm [43] to filter low complexity regions from proteins before executing a database search.

Choice of algorithm

The different algorithms, Smith-Waterman, FASTA, or BLAST add different restrictions to the simple model of sequence evolution on which database searching is based. Smith-Waterman is the most rigorous algorithm and does not place any heuristic restrictions on the evolutionary model. Smith-Waterman is both the most sensitive and the least selective algorithm. The actual pattern of evolutionary changes between your newly determined query sequence and any homologue in the database can be incompatible with the heuristic restrictions imposed by either BLAST or FASTA. Alternatively the additional selectivity that results from these restrictions can sometimes be an advantage. There is no one program that is always best at finding distantly related sequences for all gene and protein families, although the Smith-Waterman algorithm is most often the best (2,3).

Finally, the sequence database itself represents a large store of previously acquired knowledge.

Making the best use of this knowledge can save us many months of expensive laboratory experimentation and allow us to use our limited resources to acquire truly new knowledge. The size of this potential gain is the determining factor in deciding how much effort to devote to any particular database search.

The content of this page was mainly stolen from http://www.psc.edu/biomed/dist-ed/ and http://www.ncbi.nlm.nih.gov/BLAST/tutorial/

Arne Elofsson

Last modified: Wed Nov 6 15:16:08 CET 2002