Home Page Local Home Page Text Assignments Schedule Links

Bioinformatics for protein sequence, structure and function


Profiles_3D - Theory

Profiles_3D



2       Theory


Overview of 3D Profile Analysis

There is to date no known solution to the protein folding problem (that is, the prediction of protein tertiary structure from an amino acid sequence). There are, however, hundreds of experimentally determined protein structures. These can be grouped into structurally related families, where members of each group have the same basic fold. An important principle to emerge from the study of such families is that protein folding is highly degenerate: a given fold can be encoded by many different sequences, some of which may show no significant sequence homology to other members of the family. This observation raises the intriguing possibility that many of the proteins for which structures are not known, and for which there is no detectable sequence homology with proteins of known structure, may nevertheless have essentially the same fold as a known structural family. In such a case the problem that must be solved is the inverse of the protein folding problem: given a protein structure, what amino acid sequences are likely to fold into that structure?

Molecular Simulations' Profiles-3D product provides tools for answering this question. It is based largely on algorithms developed in the laboratory of Dr. David Eisenberg at the University of California, Los Angeles. The 3D profile method measures the compatibility of an amino acid sequence with a three-dimensional protein structure. It does this by reducing the three-dimensional structure to a simplified one-dimensional representation called an environment string, which is then compared with the one-dimensional amino acid sequence. The method is highly sensitive to distant relationships that cannot be detected by sequence similarity alone. It can also be used to check the validity of a hypothetical protein structure by measuring the compatibility of that structure with the protein's own sequence.

Regardless of the application, the method involves three basic operations:

1.   Reduction of the three-dimensional structure to a one-dimensional string of residue environments. These environments are categorized according to the area of the side chain that is buried in the protein, the fraction of the side chain area that is exposed to polar atoms, and the local secondary structure.

2.   Generation of a position-dependent comparison matrix known as the 3D profile. This is calculated from the string of residue environments, heuristic gap penalties and a precalculated scoring matrix. The gap penalties may vary with position in the profile. The scoring matrix is calculated from the probabilities of finding each of the twenty amino acids in each of the environment classes as observed in a database of known structures and related sequences.

3.   Alignment of a sequence with the 3D profile. The resulting alignment score is a measure of the compatibility of the sequence with the structure described by the 3D profile.

The following sections explain the details of these operations and the various ways in which the results can be used.


Determining the Residue Environment

The environment of each residue in the three-dimensional structure is first classified according to the area of the side chain that is buried in the protein. A residue is considered exposed to solvent (environment class E) if the area buried is less than 40 Å2. It is considered partially buried (class P) if the area buried is between 40 and 114 Å2. It is considered buried (class B) if the area buried is greater than 114 Å2. The buried and partially buried classes are further subdivided according to the fraction of the side chain area that is exposed to polar atoms ("fraction polar", denoted f). For this purpose polar atoms are defined as those of the solvent and the oxygen and nitrogen atoms of the protein. The buried class is subdivided into classes B1 (f < 0.45), B2 (0.45 <= f < 0.58) and B3 (f >= 0.58). The partially buried class is subdivided into classes P1 (f < 0.67) and P2 (f >= 0.67). These six basic environment classes (E, P1, P2, B1, B2 and B3) are summarized in Figure 1. The determination of the boundaries between them is explained later in this chapter (see The 3D-1D Scoring Matrix ).

Figure 1 . Environment Classes

The six basic environment classes determined by the area of the side chain that is buried in the protein and by the fraction of the side chain area that is exposed to polar atoms (after figure 4 of Bowie et al., 1991).  

Finally, each of the six basic environment classes is subdivided into three classes according to the local secondary structure: alpha helix, beta sheet, or other. The result is a total of eighteen distinct environment classes.


Calculation of Area Buried and Fraction Polar

The solvent-accessible area of each atom in a side chain is measured by first placing an imaginary solvent sphere around the atom, with radius equal to the sum of the Van der Waals radius of the atom (Richmond and Richards, 1978) and the radius of a water molecule. Sample points are placed on this sphere every 0.75 Å. If a sample point is not within the solvent sphere of any other protein atom, then that point is considered accessible to the solvent; otherwise it is considered buried. The solvent accessible surface area of the atom, denoted Aa, is then calculated thus:

Eq. 1

where na is the number of sample points accessible to solvent, nt is the total number of sample points on the solvent sphere, and At is the total surface area of the solvent sphere. The solvent-accessible area of the side chain is then calculated as the sum of the solvent-accessible areas of its constituent atoms, including the alpha carbon. The area of the side chain that is buried in the protein is defined as the difference between the solvent-accessible area of the side chain in the protein and in a GLY-X-GLY tripeptide.The latter value has been tabulated for each of the 20 amino acids by Eisenberg et al. (1989).

The fraction of the side chain area covered by polar atoms is calculated thus:

Eq. 2

where Np is the number of sample points covered by polar atoms (nitrogen, oxygen or solvent) for all atoms in the side chain, and Nt is the total number of sample points for all atoms in the side chain. Sample points covered by atoms of the side chain itself are excluded from both of the counts Np and Nt. If a sample point lies within the solvent sphere of both a polar and a nonpolar atom, then that point is counted in Np if and only if the polar atom is the closer of the two.


The 3D-1D Scoring Matrix

Although the environment string is one-dimensional, like an amino acid sequence, it cannot be compared to or aligned with a sequence without some measure of the compatibility of each of the twenty amino acids with each of the eighteen environment classes. The 3D-1D scoring matrix is a 20 X 18 matrix that contains exactly this information. Each element si,j in the matrix is a score that indicates the compatibility of residue i with environment j. The greater the value, the greater the compatibility. Negative values indicate poor compatibility. For a complete listing of the matrix, see 3D-1D Scoring Table in File Formats.

The individual matrix elements are information values (Fano, 1961) and are calculated thus:

Eq. 3

where P(i:j) is the probability of finding residue i in environment j, and Pi is the overall probability of finding residue i in any environment. For the 3D-1D scoring matrix, these probabilities were estimated from a database of 16 known protein structures and sets of homologous sequences aligned to the sequences of the 16 structures. The structures chosen and the methods for selecting and aligning the related sequences are described by Lüthy et al. (1991). For each residue position in each of the 16 sets of aligned sequences, the environment class was determined from the known structure.

The probability P(i:j) was then estimated thus:

Eq. 4

where Ni,j is the number of positions in the 16 alignments at which one or more residues of type i were found to align with environment j. The denominator is the total number of residue replacements found for all environments of type j.

The probability Pi was estimated thus:

Eq. 5

where Ni is the number of positions in the 16 alignments at which one or more residues of type i occurred. The denominator is the total number of residue replacements in the database of alignments (8273 for the 16 alignments used to generate the 3D-1D scoring matrix).

The boundaries of the environment classes (Figure 1) were then adjusted iteratively, the matrix being recalculated each time, so as to maximize St, the total 3D-1D score summed over all residue types and environment classes in the 16 alignments:

Eq. 6

For further details see Bowie et al. (1991).


Alignment Using the Profile Method

It is possible simply to align an amino acid sequence to an environment string using the 3D-1D scoring matrix and any conventional sequence alignment algorithm. It is advantageous, however, not to do so at this stage, but instead first to construct from the environment string a 3D profile. The profile is a matrix of 22 columns and N rows, where N is the number of residue positions in the environment string. The first 20 elements of each row j are the compatibilities of each of the 20 amino acids with the environment at position j in the environment string, taken from the 3D-1D scoring matrix. The last two elements are the penalties for gap opening and gap extension, respectively, at position j. The amino acid sequence is aligned with the 3D profile using a dynamic programming algorithm (Smith and Waterman 1981).

An advantage of the profile method is that it facilitates the use of position-dependent gap penalties. This makes it possible, for example, to impose a higher penalty on gaps that occur within alpha helices or beta sheets (Lüthy et al. 1991). The rationale for this strategy is that insertions and deletions are more likely to occur in the random coil regions that separate regions of regular secondary structure.


Strategies for Using the 3D Profile Method

The alignment of an amino acid sequence with a 3D profile yields an overall 3D-1D score that is a measure of the compatibility of the sequence with the structure described by the profile. There are different strategies for choosing the environment(s) and sequence(s) to be aligned, and for using the compatibility measures, depending on the problem to be addressed.

1.   Given a sequence, find compatible structures.

This strategy is appropriate for the beginning of a Homology modeling project. The goal is to build a model structure of a protein for which only the sequence is known. The model structure must be derived from related proteins (called reference proteins) for which the structures are known. Traditionally these are found by searching a database of the sequences of proteins in the Brookhaven protein data bank. This is effective only if the sequence of the model protein shares sufficient sequence similarity with one or more of the proteins in the data bank.

The 3D profile method provides a more sensitive alternative.The strategy is to search a database of 3D profiles constructed from the known protein structures, using the model sequence as the probe. Those 3D profiles having the greatest compatibility with the probe sequence are candidates for use as reference proteins. A search for compatible structures using the 3D profile method may succeed in some cases for which a simple sequence similarity search would fail.

2.   Given a preliminary or model structure, test its validity.

This strategy is appropriate for the final phase of a Homology modeling project, or for testing a preliminary protein structure based on experimental data. It depends on the principle that a protein's structure must be compatible with its own sequence. The strategy is to construct the 3D profile of a protein and measure its compatibility with the sequence of the same protein. In this case the alignment is known, so only the calculation of the 3D-1D scores is necessary. It is possible to calculate local 3D-1D scores in a fixed-length sliding window (typically about 5 to 20 residues long). These can be plotted against residue position to reveal local regions of relatively high or low 3D-1D compatibility (Lüthy et al. 1992). Regions having unusually low scores are likely to be places where the backbone has been incorrectly threaded through the electron density data (if the structure is experimental), or where additional energy minimization or other structural refinements are necessary (if the structure is a model).

3.   Given a structure, find compatible sequences.

This approach is useful for analyzing evolutionary relationships among proteins, and for inferring possible structures and functions of some of the uncharacterized proteins in the database of sequences. The strategy is to search the database of sequences using a single 3D profile as the probe. Those sequences that produce high compatibility scores are likely to be structurally related to the probe (Bowie et al. 1991).


Normalization to Compensate for Length Effects

When searching a database of 3D profiles to find structures compatible with a probe sequence (strategy 1, above), some of the variation in compatibility scores results from variation in the lengths of the profiles. In general, higher scores are obtained with longer profiles. To compare the scores in a meaningful way requires that they be normalized to compensate for these length effects.

In a related problem (searching a database of sequences using a probe sequence profile), Gribskov et al. (1990) found that these length effects can be modeled by an equation of this form:

Eq. 7

Where Sp is the compatibility score predicted for the alignment of a random sequence with an unrelated profile, L is the length of the sequence, and A, B, and C are empirically determined constants. More recently they have found that slightly better results are obtained using an equation of this form:

Eq. 8

Profiles-3D software uses an equation of this form to normalize the results of a search for compatible structures. For this application, L represents the length of a 3D profile. The appropriate choices for the constants A, B, and C depend on the probe sequence and on the contents of the database of profiles. The constants are estimated using the Levenberg-Marquardt method for nonlinear curve fitting (Press et al. 1988).

The curve-fitting is refined so as to eliminate extreme outlier points for which the score is unusually high. Typically these outliers are profiles that are closely related to the probe sequence. They must be eliminated from the curve-fitting that estimates these constants, because Eq. 8 is intended to model the variation in score that results purely from length effects, not from biological relatedness. To eliminate the outliers, the scores are sorted by the corresponding profile lengths and grouped into pools of equal size. Those pools for which the standard deviation of the score is unusually high (i.e., those likely to contain outliers) are not used in the estimation of A, B, and C. The pool size is 20 if the database contains 100 or more profiles. It is 10 if the number of profiles is between 50 and 100. If there are fewer than 50 profiles in the database to be searched, then the constants A, B, and C cannot be reliably estimated by curve-fitting. In such cases, A, B, and C are set to default values that appear to work well for a variety of protein families.

Once A, B, and C are known, the compatibility score between the probe sequence and each profile in the database is then normalized by dividing the raw score by the score predicted by :

Eq. 9

where Sn is the normalized score and S is the raw 3D-1D compatibility score. The normalized score will be near 1.0 for those profiles that are unrelated to the probe sequence.

To further simplify the analysis, the normalized scores are transformed to Z scores. In essence, this transforms the distribution of normalized scores to a distribution that has a mean of 0 and a standard deviation of 1. The Z score is calculated thus:

Eq. 10

where m and are the mean and standard deviation, respectively, of the normalized scores. Normalized scores that are much greater than 1 are excluded from the calculation of m and . These excluded outliers are from profiles that are probably related to the probe sequence.

The Z score is the most useful indicator of the relative significance of the relationships between a sequence and various 3D profiles. The higher the Z score, the greater the compatibility between the sequence and the profile. To determine in a rigorous way the absolute statistical significance of such a Z score is as yet an unsolved problem. The Z score that results from a search of a database of 3D profiles is best used as a relative measure for choosing those profiles that are most compatible with the probe sequence.

When searching a database of sequences with a probe 3D profile (strategy 3, above), the resulting compatibility scores are normalized and transformed to Z scores in a manner that is exactly analogous to the procedure just described. The only difference is that different pool sizes are used in the curve fitting that estimates the constants A, B and C in Eq. 8. The pool size is 50 if the database contains 1500 or more sequences. It is 20 if the number of sequences is between 200 and 1500. If there are fewer than 200 sequences in the database to be searched, then the constants A, B, and C cannot be reliably estimated by curve-fitting. In such cases, A, B, and C are set to default values that appear to work well for a variety of protein families. Experience has shown that a Z score greater than 7 virtually guarantees that the sequence found adopts the same fold as the probe 3D profile (Bowie et al. 1991).


Interpreting the 3D-1D Score in Structure Verification

When the 3D profile method is used to verify a structure (strategy 2, above) the raw compatibility score alone is difficult to interpret. In this case it is necessary not only to compensate for length effects, but also to compare the score to those obtained using structures known to be correct. Lüthy et al. (1992) dealt with this problem by calculating the 3D-1D self-compatibility scores for all structures in the Brookhaven data bank determined at resolutions less than or equal to 2 Å and with R-factors less than 20%. They made a log-log scatterplot of these scores against sequence length and found that they fell approximately on a straight line. A linear least-squares fit to these data, transformed from the logarithmic coordinates, yielded this relation:

Eq. 11

where Scalc is the score expected for a correct structure having sequence length L. This equation is plotted as the solid curve in figure 2 of Lüthy et al. (1992). In the same figure these authors plotted the self-compatibility scores of several structures known to be incorrect. They found that severely misfolded structures typically had scores less than 0.45Scalc. The self-compatibility scores of most structures in the Brookhaven protein data bank fall between Scalc and 0.45 Scalc. This score range provides a useful rule of thumb for interpreting the 3D-1D self-compatibility score of a hypothetical protein structure. A score below the lower end of this range indicates a structure that is almost certainly incorrect. A score above but near the lower end of the range indicates a structure that may be correct, but that is of questionable quality. A score near or above the high end indicates a reliable structure.

Lüthy et al. (1992) also points out several caveats that must be considered when interpreting a self-compatibility score. Correctly folded oligomeric proteins may have unusually low scores if the score is calculated for the individual protomers, rather than for the complete oligomer. This effect results from the incompatibility of the protein interface regions with exposed environments. The effect is negligible for large proteins but can be significant for small ones. Also, a model protein with only a few incorrect regions may have an overall self-compatibility score above 0.45 Scalc. Local errors of this kind can often be found, however, by analyzing a graph of the local 3D-1D score (calculated in a fixed-length sliding window) versus position in the sequence. The results of the sliding window analysis are sensitive to the window length. A smaller window gives greater resolution but may introduce spurious noise. A larger window tends to average out the insignificant noise, but at the cost of lower spatial resolution. Lüthy et al. (1992) found that a window length of 21 was useful for analyzing grossly misfolded structures. Finally, if a portion of a hypothetical structure is essentially unfolded (i.e., all side chains in that region are completely accessible to solvent), the local 3D-1D score may not detect the abnormal structure if most of the residues in that region are compatible with exposed environments.




Last updated October 06, 1997 at 06:19PM PDT.
Copyright © 1997, Molecular Simulations, Inc. All rights reserved.
Arne Elofsson
Stockholm Bioinformatics Center,
Department of Biochemistry,
Arrheniuslaboratoriet
Stockholms Universitet
10691 Stockholm, Sweden
Tel: +46-(0)8/161553
Fax: +46-(0)8/158057
Hem: +46-(0)8/6413158
Email: arne@sbc.su.se
WWW: /~arne/