Profiles_3D |
Molecular Simulations' Profiles-3D product provides tools for answering this question. It is based largely on algorithms developed in the laboratory of Dr. David Eisenberg at the University of California, Los Angeles. The 3D profile method measures the compatibility of an amino acid sequence with a three-dimensional protein structure. It does this by reducing the three-dimensional structure to a simplified one-dimensional representation called an environment string, which is then compared with the one-dimensional amino acid sequence. The method is highly sensitive to distant relationships that cannot be detected by sequence similarity alone. It can also be used to check the validity of a hypothetical protein structure by measuring the compatibility of that structure with the protein's own sequence.
Regardless of the application, the method involves three basic operations:
1. Reduction of the three-dimensional structure to a one-dimensional string of residue environments. These environments are categorized according to the area of the side chain that is buried in the protein, the fraction of the side chain area that is exposed to polar atoms, and the local secondary structure.
The following sections explain the details of these operations and the various ways in which the results can be used.
The environment of each residue in the three-dimensional structure is first classified according to the area of the side chain that is buried in the protein. A residue is considered exposed to solvent (environment class E) if the area buried is less than 40 Å2. It is considered partially buried (class P) if the area buried is between 40 and 114 Å2. It is considered buried (class B) if the area buried is greater than 114 Å2. The buried and partially buried classes are further subdivided according to the fraction of the side chain area that is exposed to polar atoms ("fraction polar", denoted f). For this purpose polar atoms are defined as those of the solvent and the oxygen and nitrogen atoms of the protein. The buried class is subdivided into classes B1 (f < 0.45), B2 (0.45 <= f < 0.58) and B3 (f >= 0.58). The partially buried class is subdivided into classes P1 (f < 0.67) and P2 (f >= 0.67). These six basic environment classes (E, P1, P2, B1, B2 and B3) are summarized in Figure 1. The determination of the boundaries between them is explained later in this chapter (see The 3D-1D Scoring Matrix ).
Determining the Residue Environment
Finally, each of the six basic environment classes is subdivided into three classes according to the local secondary structure: alpha helix, beta sheet, or other. The result is a total of eighteen distinct environment classes.
The solvent-accessible area of each atom in a side chain is measured by first placing an imaginary solvent sphere around the atom, with radius equal to the sum of the Van der Waals radius of the atom (Richmond and Richards, 1978) and the radius of a water molecule. Sample points are placed on this sphere every 0.75 Å. If a sample point is not within the solvent sphere of any other protein atom, then that point is considered accessible to the solvent; otherwise it is considered buried. The solvent accessible surface area of the atom, denoted Aa, is then calculated thus:
Calculation of Area Buried and Fraction Polar
Eq. 1
where na is the number of sample points accessible to solvent, nt is the total number of sample points on the solvent sphere, and At is the total surface area of the solvent sphere. The solvent-accessible area of the side chain is then calculated as the sum of the solvent-accessible areas of its constituent atoms, including the alpha carbon. The area of the side chain that is buried in the protein is defined as the difference between the solvent-accessible area of the side chain in the protein and in a GLY-X-GLY tripeptide.The latter value has been tabulated for each of the 20 amino acids by Eisenberg et al. (1989).
The fraction of the side chain area covered by polar atoms is calculated thus:
Eq. 2
where Np is the number of sample points covered by polar atoms (nitrogen, oxygen or solvent) for all atoms in the side chain, and Nt is the total number of sample points for all atoms in the side chain. Sample points covered by atoms of the side chain itself are excluded from both of the counts Np and Nt. If a sample point lies within the solvent sphere of both a polar and a nonpolar atom, then that point is counted in Np if and only if the polar atom is the closer of the two.
Although the environment string is one-dimensional, like an amino acid sequence, it cannot be compared to or aligned with a sequence without some measure of the compatibility of each of the twenty amino acids with each of the eighteen environment classes. The 3D-1D scoring matrix is a 20 X 18 matrix that contains exactly this information. Each element si,j in the matrix is a score that indicates the compatibility of residue i with environment j. The greater the value, the greater the compatibility. Negative values indicate poor compatibility. For a complete listing of the matrix, see 3D-1D Scoring Table in File Formats.
The 3D-1D Scoring Matrix
The individual matrix elements are information values (Fano, 1961) and are calculated thus:
Eq. 3
where P(i:j) is the probability of finding residue i in environment j, and Pi is the overall probability of finding residue i in any environment. For the 3D-1D scoring matrix, these probabilities were estimated from a database of 16 known protein structures and sets of homologous sequences aligned to the sequences of the 16 structures. The structures chosen and the methods for selecting and aligning the related sequences are described by Lüthy et al. (1991). For each residue position in each of the 16 sets of aligned sequences, the environment class was determined from the known structure.
The probability P(i:j) was then estimated thus:
Eq. 4
where Ni,j is the number of positions in the 16 alignments at which one or more residues of type i were found to align with environment j. The denominator is the total number of residue replacements found for all environments of type j.
Eq. 5
where Ni is the number of positions in the 16 alignments at which one or more residues of type i occurred. The denominator is the total number of residue replacements in the database of alignments (8273 for the 16 alignments used to generate the 3D-1D scoring matrix).
Eq. 6
For further details see Bowie et al. (1991).
It is possible simply to align an amino acid sequence to an environment string using the 3D-1D scoring matrix and any conventional sequence alignment algorithm. It is advantageous, however, not to do so at this stage, but instead first to construct from the environment string a 3D profile. The profile is a matrix of 22 columns and N rows, where N is the number of residue positions in the environment string. The first 20 elements of each row j are the compatibilities of each of the 20 amino acids with the environment at position j in the environment string, taken from the 3D-1D scoring matrix. The last two elements are the penalties for gap opening and gap extension, respectively, at position j. The amino acid sequence is aligned with the 3D profile using a dynamic programming algorithm (Smith and Waterman 1981).
Alignment Using the Profile Method
An advantage of the profile method is that it facilitates the use of position-dependent gap penalties. This makes it possible, for example, to impose a higher penalty on gaps that occur within alpha helices or beta sheets (Lüthy et al. 1991). The rationale for this strategy is that insertions and deletions are more likely to occur in the random coil regions that separate regions of regular secondary structure.
The alignment of an amino acid sequence with a 3D profile yields an overall 3D-1D score that is a measure of the compatibility of the sequence with the structure described by the profile. There are different strategies for choosing the environment(s) and sequence(s) to be aligned, and for using the compatibility measures, depending on the problem to be addressed.
Strategies for Using the 3D Profile Method
In a related problem (searching a database of sequences using a probe sequence profile), Gribskov et al. (1990) found that these length effects can be modeled by an equation of this form:
Eq. 7
Where Sp is the compatibility score predicted for the alignment of a random sequence with an unrelated profile, L is the length of the sequence, and A, B, and C are empirically determined constants. More recently they have found that slightly better results are obtained using an equation of this form:
Eq. 8
Profiles-3D software uses an equation of this form to normalize the results of a search for compatible structures. For this application, L represents the length of a 3D profile. The appropriate choices for the constants A, B, and C depend on the probe sequence and on the contents of the database of profiles. The constants are estimated using the Levenberg-Marquardt method for nonlinear curve fitting (Press et al. 1988).
The curve-fitting is refined so as to eliminate extreme outlier points for which the score is unusually high. Typically these outliers are profiles that are closely related to the probe sequence. They must be eliminated from the curve-fitting that estimates these constants, because Eq. 8 is intended to model the variation in score that results purely from length effects, not from biological relatedness. To eliminate the outliers, the scores are sorted by the corresponding profile lengths and grouped into pools of equal size. Those pools for which the standard deviation of the score is unusually high (i.e., those likely to contain outliers) are not used in the estimation of A, B, and C. The pool size is 20 if the database contains 100 or more profiles. It is 10 if the number of profiles is between 50 and 100. If there are fewer than 50 profiles in the database to be searched, then the constants A, B, and C cannot be reliably estimated by curve-fitting. In such cases, A, B, and C are set to default values that appear to work well for a variety of protein families.
Once A, B, and C are known, the compatibility score between the probe sequence and each profile in the database is then normalized by dividing the raw score by the score predicted by :
Eq. 9
where Sn is the normalized score and S is the raw 3D-1D compatibility score. The normalized score will be near 1.0 for those profiles that are unrelated to the probe sequence.
Eq. 10
where m and
are the mean and standard deviation, respectively, of the normalized scores. Normalized scores that are much greater than 1 are excluded from the calculation of m and
. These excluded outliers are from profiles that are probably related to the probe sequence.
When the 3D profile method is used to verify a structure (strategy 2, above) the raw compatibility score alone is difficult to interpret. In this case it is necessary not only to compensate for length effects, but also to compare the score to those obtained using structures known to be correct. Lüthy et al. (1992) dealt with this problem by calculating the 3D-1D self-compatibility scores for all structures in the Brookhaven data bank determined at resolutions less than or equal to 2 Å and with R-factors less than 20%. They made a log-log scatterplot of these scores against sequence length and found that they fell approximately on a straight line. A linear least-squares fit to these data, transformed from the logarithmic coordinates, yielded this relation:
Interpreting the 3D-1D Score in Structure Verification
Eq. 11
where Scalc is the score expected for a correct structure having sequence length L. This equation is plotted as the solid curve in figure 2 of Lüthy et al. (1992). In the same figure these authors plotted the self-compatibility scores of several structures known to be incorrect. They found that severely misfolded structures typically had scores less than 0.45Scalc. The self-compatibility scores of most structures in the Brookhaven protein data bank fall between Scalc and 0.45 Scalc. This score range provides a useful rule of thumb for interpreting the 3D-1D self-compatibility score of a hypothetical protein structure. A score below the lower end of this range indicates a structure that is almost certainly incorrect. A score above but near the lower end of the range indicates a structure that may be correct, but that is of questionable quality. A score near or above the high end indicates a reliable structure.
Lüthy et al. (1992) also points out several caveats that must be considered when interpreting a self-compatibility score. Correctly folded oligomeric proteins may have unusually low scores if the score is calculated for the individual protomers, rather than for the complete oligomer. This effect results from the incompatibility of the protein interface regions with exposed environments. The effect is negligible for large proteins but can be significant for small ones. Also, a model protein with only a few incorrect regions may have an overall self-compatibility score above 0.45 Scalc. Local errors of this kind can often be found, however, by analyzing a graph of the local 3D-1D score (calculated in a fixed-length sliding window) versus position in the sequence. The results of the sliding window analysis are sensitive to the window length. A smaller window gives greater resolution but may introduce spurious noise. A larger window tends to average out the insignificant noise, but at the cost of lower spatial resolution. Lüthy et al. (1992) found that a window length of 21 was useful for analyzing grossly misfolded structures. Finally, if a portion of a hypothetical structure is essentially unfolded (i.e., all side chains in that region are completely accessible to solvent), the local 3D-1D score may not detect the abnormal structure if most of the residues in that region are compatible with exposed environments.
Arne Elofsson Stockholm Bioinformatics Center, Department of Biochemistry, Arrheniuslaboratoriet Stockholms Universitet 10691 Stockholm, Sweden |
Tel: +46-(0)8/161553 Fax: +46-(0)8/158057 Hem: +46-(0)8/6413158 Email: arne@sbc.su.se WWW: /~arne/ |
---|