Modelling loops

Rodriquez et al., [131] studied the structure of surface accessible loops because they realised that such loops are the major source of errors during modelling experiments. They studied 34 pairs of known structures and analysed what happened if they pretended not to know one of the two and then model it based on the other partner of the pair. Of course this is not fair because in the real world the structure of the model is never known, but from this study we can learn a lot about what can all go wrong upon modelling proteins. They concentrated on loops with equal length in the two structures to avoid having to model insertions and deletions and found three reasons why one often observes different conformations in similar structures:

Table 1 summarises the results for 116 loops that had an RMS deviation (RMSd) between the model and the real structure that was two times larger than total RMSd.
Symmetry contact in model or in template            46
Symmetry contact in model and in template           43
No symmetry contacts in either of the two           27
Mutations involving proline                         25
Mutations involving glycine                         26
Mutations involving proline and glycine	             4
Cases without any obvious reason for model problems 17
Symmetry contacts combined with proline or glycine  38
Table 1. Most probable reason for the conformational differences between loops in homologous structures. In 75% of all cases crystal symmetry contacts are involved. Not all numbers add up correctly because multiple problems can occur at the same time. Only 17 out of 116 cases were not trivially put in any of the three major problem categories.

The largest fraction of all problems is clearly different symmetry contacts. This problem provides a principal limit to the accuracy of the model and to the posibilities of estimating the reliability of models. Figure 15 and 16 show two examples of symmetry induced conformational differences.

Figure 15

Figure 16,17

A more predictable scenario ocurrs when the backbone has to move to make space for a bulky residue as can be seen in figure 17 in which the 2.99 Angstrom displacement of the loop from residues 54 to 59 in 1POH [130] relative to 1PTF [132] is probably caused by the mutation of Ser37 in 1POH to Tyr in 1PTF.

Modelling a proline is another big problem. When a proline replaces another residue in many cases the existing backbone has torsion angles that are very unfavourable for proline, and the proline insertion leads to local backbone adaptations. The worst cases are often found for the Gly->Pro mutation, because glycine can have almost every conformation without restrictions. Figure 18 shows the superposition of 5HPV [133] and 1IVP [134]. The loop from residues 34 to 42 is shown to illustrate the change in the backbone conformation due to the mutation of Gly to Pro at position 39 in 5HPV. The backbone at this position can not accomodate a proline (f=115.2, y=131.0, w=177.9). Proline 39 sits in 1IVP in a favourable conformation (for a Pro) (f=-87.5, y=137.9, w=179.6). The residues in the neighbourhood are are also influenced by these backbone torsion angle differences and the loop shows a maximum Ca-Ca displacement of 3.2 Angstrom for the Ca of residue 40.

Figure 18

Insertions in the model

In case of 75% or more sequence identity between the structure and the model one seldomly encounters insertions or deletions, and when they are encountered, they normally are short.

One of the major problems in model building with intermediate homology is the insertion of loops. If an insertion in the sequence occurs relative to the structure, there is no template to model on, and other techniques have to be applied. The techniques used to model loops are:

Many articles have been published about these topics, [141-147], but loop building is still a wide open field as is perhaps best indicated by the fact that for more than five years already no significantly new methods have appeared in literature. The world-wide modelling competition held in 1994 [148] made clear that correctly ab- initio modelling of loops is at present not yet possible.

Verification of the quality of the model

All models built by homology will have errors. Sidechains can be placed incorrectly, or whole loops can be misplaced. As with most errors, they become less of a problem when they can get localized. For example, upon modeling a protease it is probably not important that a loop far away from the active site is placed incorrectly.

The most important step in the process of model building by homology is therefore undoubtedly the verification of the model, and the estimation of the likelihood and magnitude of errors.

There are two principally different ways to estimate errors in a structure.

The key aspect is the development of criteria with sufficient discriminatory power to distinguish a good model from a bad one. An example is provided by deliberately misfolded proteins in which the sequence of a protein known to have an all-helical 3D structure is placed into a known structure of a completely different type, an antiparallel b- barrel, and vice versa. For the evaluation of the quality of these clearly incorrect hypothetical structures, intramolecular energy, calculated in vacuum using standard empirical potentials, is not a sensitive criterion [Novotny et al 84, 88]. The free energy difference between the folded and unfolded states would be an ideal criterion, but present theories are not capable of calculating free energy differences to sufficient accuracy.

Faced with the lack of an accurate theory of protein folding, empirical observations of regularities gleaned from the database of solved structures can be very useful. A variety of statistical criteria, which measure the preferential distribution of hydrophobic side chains in the interior of proteins, have been used successfully to discriminate between deliberately misfolded and native structures [64,149-151].

Normality indices for structures have already proven their power in structure verification. Many characteristics of protein structures lend themselves for normality analysis. Most of them are directly or indirectly based on the analysis of contacts, either inter residue contacts, or contacts with water. Some published examples are:

Atomic contacts are observed because they are energetically favored. Real structures cannot tolerate too many unfavorable interactions. Thus for a model to be correct only a few infrequently observed atomic contacts are allowed. We made a detailed analysis of atom atom contacts [155]. WHAT IF [135] holds a module that compares the local contact patterns with the average contact patterns for similar residue-residue contacts found in the database. This method can be summarized as follows: If a residue-residue contact has the same contact patterns and the same spatial orientation as a contact that occurs often in the database then a high score is given. If a contact in the modeled molecule seems rather unique, either from a point of view of which residues make the contact, or from a point of view of directionality of the contact, a low score is given. This 'quality control' of local packing has proven to be a powerful tool for the detection of abnormal structures. Most methods used for the verification of protein structures can also be used for the verification of models. Not all methods will be useful because certain experimental errors simply are not made by the better modelling programs. In general, however, a verification report is very helpful for the modeller and her friends when they are using the model for the analysis of experimental results or prediction of new experiments.

How good are the models actually?

The quality of protein models built by homology to a template structure is normally determined by the RMS errors in models of proteins of which the structure is known. Rodriquez et al., selected from the PDB [ref] 34 pairs of protein structures that superpose well, have 35% to 98% sequence identity, and have no insertions or deletions. They created this test-set to analyze what could potentially be the major sources for errors in protein modelling and in the assessment of the model quality.

The dataset was carefully selected to be representative for the universe of proteins, but they made sure that they would not encounter big surprises. The models are thus representative for the best scenario one can expect in practical cases, and not for a typical scenario. The selection of 34 pairs of proteins was done using the following criteria:

PDB	r	R	RMSd	SID %	Class	RMSe	Description 
1poh	2.00	0.14				1.978	Phosphotransferase (E. coli) 
1ptf	1.60	0.16	1.244	35.29	mixed	1.977	Phosphotransferase (S. faecalis) 
1nhk	1.90	0.17				2.410	Nucleoside Diphosphate Kinase (M. xanthus) 
1ndc	2.00	0.18	1.554	43.75	mixed	2.082	Nucleoside Diphosphate Kinase (D. discoideum) 
1bpt	2.00	0.17				2.003	Pancreatic Trypsin Inhibitor (BPTI) (B. taurus) 
1aap	1.50	0.18	0.973	44.64	mixed	1.984	PInh. Domain Of Alzheimer's Protein (H. sapiens)
5pal	1.54	0.17				1.626	Parvalbumin (T. semifasciata) 
1omd	1.85	0.17	0.776	44.86	alpha	1.375	Oncomodulin (R. norvegicus) 
1pza	1.80	0.18				1.752	Pseudoazurin (A. faecalis) 
1pmy	1.50	0.20	0.995	45.00	beta	1.807	Pseudoazurin (M. extorquens) 
1thbB	1.50	0.20				1.972	Hemoglobin (H. sapiens) 
1pbxB	2.50	0.18	1.240	45.21	alpha	1.983	Hemoglobin (P. bernacchii) 
5hvpB	2.00	0.18				1.716	HIV-1 Protease (HIV Type 1) 
1ivpA	2.50	0.20	0.892	48.48	beta	1.531	HIV-2 Protease (HIV Type 2) 
2sam	2.40	0.19				1.496	SIV-1 Protease (SIV Type 1)	
4phvB	2.10	0.18	1.030	51.52	beta	1.863	HIV-1 Protease (HIV Type 1) 
2cro	2.35	0.20				1.872	434 Cro Protein (Phage 434) 
2or1L	2.50	0.18	0.825	52.38	alpha	1.882	434 Repressor (Phage 434) 
1crb	2.10	0.19				1.423	Cellular Retinol Binding Protein (R. rattus) 
1opbC	1.90	0.17	0.718	56.39	beta	1.436	Cellular Retinol Binding Protein II (R. rattus) 
1fkf	1.70	0.17				1.287	FK-506 Binding Protein (H. sapiens)
1yat	2.50	0.18	0.818	57.01	beta	1.189	Fk-506 Binding Protein (S. cerevisiae) 
1pvaA	1.65	0.20				1.244	Parvalbumin (E. lucius) 
1cdp	1.60	0.16	0.702	62.04	alpha	1.130	Parvalbumin (C. carpio) 
2ycc	1.90	0.20				1.390	Cytochrome C (S. cerevisiae)
5cytR	1.50	0.16	0.574	62.14	alpha	1.386	Cytochrome C (T. alalunga) 
1azrA	2.40	0.17				1.469	Azurin (Pseudomonas aeruginosa)
1aizA	1.80	0.17	0.982	63.28	mixed	1.443	Azurin (Alcaligenes denitrificans)
4azuA	1.90	0.18				1.387	Azurin (Pseudomonas aeruginosa) 
1azcA	1.80	0.16	0.960	63.78	mixed	1.332	Azurin (A. denitrificans) 
1mrj	1.60	0.17				1.291	Alpha-trichosanthin (T. kirilowii maxim) 
1mom	2.16	0.19	0.626	65.04	mixed	1.350	Momordin (M. charantia) 
1cad	1.80	0.19				0.999	Rubredoxin (P. furiosus) 
8rxnA	1.00	0.15	0.604	66.67	mixed	1.001	Rubredoxin (D. vulgaris)
1tadB	1.70	0.21				1.636	Transducin-alpha (B. taurus) 
1gia	2.00	0.17	1.139	69.35	alpha	1.576	Gi Alpha 1 (R. rattus)
1hsaA	2.10	0.20				1.736	Human Class I HSA (H. sapiens) 
1vaaA	2.30	0.17	1.176	72.63	mixed	1.829	MHC Class I	(M. musculus) 
1gbt	2.00	0.16				0.798	Beta-trypsin (B. taurus) 
1brcE	2.50	0.17	0.424	73.09	beta	0.865	Trypsin Variant (R. rattus) 
1babB	1.50	0.16				0.968	Hemoglobin Thionville (H. sapiens) 
1fdhG	2.50	0.32	0.513	73.29	alpha	0.933	Hemoglobin (H. sapiens) 
1dhfA	2.30	0.18				1.397	Dihydrofolate Reductase (H. sapiens) 
1dr7	2.40	0.16	0.775	75.27	mixed	1.242	Dihydrofolate Reductase (G. gallus) 
8dfr	1.70	0.19				1.335	Dihydrofolate Reductase (G. gallus) 
2dhfA	2.30	0.19	0.738	75.27	mixed	1.456	Dihydrofolate Reductase (H. sapiens)
1hna	1.85	0.23				1.611	Glutathione S-transferase (H. sapiens) 
3gstB	1.90	0.16	1.025	75.58	alpha	1.431	Glutathione S-transferase (R. rattus) 
1ala	2.25	0.20				1.042	Annexin V (G. gallus) 
1avr	2.30	0.18	0.445	77.85	alpha	0.882	Annexin V (H. sapiens) 
1bra	2.20	0.16				0.999	Trypsin (R. rattus) 
1mct	1.60	0.17	0.421	79.82	beta	1.044	Trypsin (S. scrofa)
4p2p	2.40	0.21				2.099	Phospholipase A2 (S. scrofa) 
2bpp	1.80	0.19	1.152	84.17	alpha	1.922	Phospholipase A2 (B. taurus) 
135l	1.30	0.19				1.213	Lysozyme (M. gallopavo) 
1hhl	1.90	0.17	0.732	86.82	alpha	1.184	Lysozyme (N. meleagris) 
2gbp	1.90	0.15				0.891	Galactose binding protein (E. coli) 
3gbp	2.40	0.16	0.518	94.43	mixed	0.918	Galactose binding protein (S. typhimurium) 
1emy	1.78	0.15				1.330	Myoglobin (E. maximus) 
1ymc	2.00	0.13	0.691	87.58	alpha	1.324	Sulfmyoglobin (E. caballus) 
1ovb	2.30	0.20				1.593	Ovotransferrin (Duck) 
1nnt	2.30	0.16	1.091	90.57	mixed	1.572	Ovotransferrin (G. gallus) 
2lalA	1.80	0.19				0.970	Lentil Lectin (L. culinaris) 
2ltnA	1.70	0.18	0.322	92.27	beta	0.977	Pea Lectin (P. sativum) 
2chf	1.80	0.18				1.955	Chey (S. typhimurium) 
1chn	1.76	0.19	1.376	97.62	mixed	1.963	Chey (E. coli) 
1etb1	1.70	0.16				0.678	Transthyretin (H. sapiens) 
1ttcA	1.70	0.18	0.255	98.31	beta	0.534	Transthyretin mutant (H. sapiens) 
Table 2. Structures used to study model quality[135]. RMSd: Root mean square displacement between equivalenced atoms in the two molecules. RMSe: Root mean square atomic misplacement between the model and the real structure. SID: percentage sequence identity between a pair of sequences. R: crystallographic R-factor. r: resolution.

Additionally the dataset should be "representative" for the universe of globular water soluble protein structures that are amenable to modelling by homology. Roughly equally many all-alpha, all-beta and mixed alpha-beta proteins were chosen, and they were distributed equally over the 35-98% pairwise sequence identity range in all these three classes. Table 2 lists the pairs of proteins used, as well as some vital statistics.

Most modelling procedures use the backbone of the template as the backbone of the model, and add the sidechains onto this backbone. The RMSe of the backbone will therefore be the same as the RMSd between the model and template backbone. We call this the starting error. Obviously, under normal conditions the final all atom RMSe will always be bigger than this starting error. Energy based calculations are not yet refined enough to improve the results significantly (see next paragraph). Statistical methods can indicate "where" backbone modifications are likely to be needed, but except for some simple cases, we can not yet predict "how" to modify the backbone.

Loops normally have roughly a similar conformation in similar structures. A weak correlation is found between differences in loop conformations and mutations involving proline or glycine. However, if loops are not predicted well, this is most often the result of differences in symmetry contacts between these loops in the model and the template structure.There is a basic error of around 1.0 Angstrom in the backbone of every model, just as a result of differences between experimental structures. Surface located residues and structural changes caused by symmetry contacts add on average another 0.5 Angstrom to the RMS. In the core the error is normally much less than 1.0 Angstrom. At the surface itb is often more than 2.0 Angstrom. Of course some models will have lower RMS errors, but the problem is that in practical cases one cannot know how good the models are, one can only gamble [61].

Energy minimisation

All 68 models were energy minimised using GROMOS [156] (other programs give the same or similar results) and after a fixed number of energy minimisation steps the half minimised structures were evaluated. The RMSe was measured, and all 68 RMSe values measured after 100, 200, etc., energy minimisation steps were bluntly averaged. The results are summarised in table 3. Two things are clearly seen. 1) The improvements than can be achieved are minimal, and 2) The energy minimisation run should be short, after a while the models get worse again. Table 3 is only an average, but inspection of all individual numbers shows that the optimum is in all but three cases between 50 and 300 energy minimisation steps. Inspection of some individual energy minimisation processes indicates that during the first steps the largest errors (such as two atoms being a bit to close to each other, or a hydrogen bond that does not have optimal geometry, or a backbone angle that was already not perfect in the template, etc.) are removed. At every step, however many, many very small errors are introduced. In the beginning removal of the big errors outweighs the introduction of the many small errors. If after a while all larger problems are solved, the only thing that still happens is the introduction of many small errors.

	Steps		Ave. RSMe
	0		1.4622
	100		1.4542
	200		1.4529
	300		1.4335
	500		1.4552
	2000		1.4553
	8000		1.4553
Table 3. Average RMSe after a fixed number of energy minimisation steps. The average RMSe was calculated averaging the RMSe of the 68 individual structures in each of the energy minimisation runs.

Modeling without homology

Most of the above deals with modeling in three dimensions. That is, it is assumed that a good model can be built. Other techniques such as secondary structure prediction can help in this case. It is often not clear why predicted secondary structures are at all published, but in the hands of a biocomputing expert some information can be extracted from the prediction. The best secondary structure prediction program that is available (this is written on august 17 1996) is without doubt PHD. This program can be used via the WWW (see below).

Future developements in protein modelling are the use of other information than homology to build models. Such information can essentially be anything. Predicted secondary structure, accessibility or contacts can equally well be used as observed cysteine bridges, proteolytic cleavage sites or accessibilities.

Concluding remarks

Model building by homology is a young field. Many improvements can still be made and much work still needs to be done to make these improvements. Our modeller can still learn a lot from the professional gambler, but we expect that improvements in energy calculation based software will within 10 years lead to a breakthrough. We would not be surprised if untill this happens improving the odds of present day methods by inclusion of information from multiple templates, the design of new algorithms and heuristics, better and larger databases, the rapid growth of the PDB, and a few more factors that we cannot yet predict, will step by step create the progress in homology modelling that is needed to close the structure gap.

	WWW addresses

	Secondary structure prediction:
http://swift.embl-heidelberg.de/predictprotein/

	Protein structure quality:
http://swift.embl-heidelberg.de/pdbreport/
http://biotech.embl-heidelberg.de:8400/

	Protein structure comparison:
http://www.ebi.ac.uk/dali/

Acknowledgements

We thank Chris sander, Rob Hooft, Glay Chinea, Enzo de Filippis, Hans Doeberling and his team, Brigitte Altenberg, Karina Krmoian for stimulating discussions and practical help. We appologise to the people working on other good modelling programs (especially Ruben Abagyan and Andrej Sali) for not having enough space to explain their methods and programs in detail. We appologise to the numerous crystallographers who made all this work possible by depositing structures in the PDB for not referring to each of the 4000 very important articles describing these structures.
This article was written by R.Rodriguez and G.Vriend.

References

1) The relation between the divergence of sequence and structure in proteins. 
Chothia, C., Lesk, A.M., EMBO J., 5 (1986) 823-836. 

2) Database of homology-derived protein structures and the structural meaning of 
sequence alignment. Sander, C., Schneider, R., PROTEINS, 9 (1991) 56-68. 

3) Modelling by homology. Swindells, M.B., Thornton, J.M., Curr.Op.Struct.Biol., 
1 (1991) 219-223.

4) Structural relationships of homologous proteins as a fundamental principle in 
homology modeling. Hilbert, M., Böhm, G., Jaenicke, R., PROTEINS, (1993), 17, 
138-151.

5) How different amino acid sequences determine similar protein structures: the 
structure and evolutionary dynamics of the globins. Lesk, A.M., Chothia, C., 
J.Mol.Biol., (1980) 136, 225-270.

6) On the use of sequence homologies to predict protein structure: identical 
pentapeptides can have completely different conformations. Kabsch, W., 
Sander, C., PNAS, (1984) 81, 1075-1078.

7) Evolution of proteins formed by b-sheets. I. Plastocyanin and Azurin. Chothia, 
C., Lesk, A.M., J.Mol.Biol., (1982) 160, 309-323.

8) Knowledge-based model building of proteins: concepts and examples. Bajorath, 
J., Stenkamp, R., Aruffo, A., Prot.Sci., (1993) 2, 1798-1810.

9) Homology modelling: inferences from tables of aligned sequences. Lesk, A.M., 
Boswell, D.R., Cuur.Op.Struc.Biol.. (1992) 2, 242-247.

10) A new method for building protein conformations from sequence alignments 
with homologues of known structure. Havel, T.F., Snow, M.E., J.Mol.Biol., 
(1991) 217, 1-7.

11) Rebuilding flavodoxin from Ca coordinates: a test study. Reid, L.S., Thornton, 
J.M., PROTEINS, (1989) 5, 170-182.

12) Comparative modeling of homologous proteins. Greer, J., Meth.Enzym., 
(1991) 202, 239-252.

13) Homology modeling of divergent proteins. Sudarsanam, S., March, C.J., 
Srinivasan, S., J.Mol.Biol., (1994) 241, 143-149.

14) Protein model building using structural homology. Lee, R.H., Nature, (1992) 
356, 543-544.

15) Comparative modelling by satisfaction of spatial restraints. Sali, A., Blundell, 
T.L., (1993) 234, 779-815.

16) Modelling of globular proteins. A distance based search procedure for the 
construction of insertion regions and pro <--> non-pro mutations. Summers, 
N.L., Karplus, M., J.Mol.Biol., (1990) 216,991-1016.

17) Prediction of homologous protein structures based on conformational 
searches and energetics. Schiffer, C.A., Caldwell, J.W., Kollmann, P.A., Stroud, 
R.M., PROTEINS, (1990) 8, 30-43.

18) Modelling by homology. Swindells, M.B., Thornton, J.M., Curr,Op.Struc.Biol., 
(1991) 1, 219-223.

19) A large scale experiment to assess protein structure prediction methods. 
Moult, J., Pedersen, J.T., Judson, R., Fidelis, K., PROTEINS, (1995) 23, 2-4.

20) A critical assessment of comparative molecular modeling of tertiary structures 
of proteins. Mosimann, S., Meleshko, R., James, N.G., PROTEINS, (1995) 23, 
301-317.

21) Analysis of six protein structures predicted by comparative modelling 
techniques. Harrison, R.W., Chatterjee, D., Weber, I.T., Proteins, (1995) 23, 463-
471.

22) Homology modelling by the ICM method. Cardozo, T., Totrov, M., Abagyan, 
R., PROTEINS, (1995) 23, 403-414.

23) Homology modelling of histidine-containing phosphocarrier protein and 
eosinophil-derived neurotoxin: construction of models and comparison with 
experiment. Church, W.B., Palmer, A., Wathey, J.C., Kitson, D.H., PROTEINS, 
(1995) 23, 422-430.

24) Confronting the problem of interconnected structural changes in the 
comparative modeling of proteins. Samudrala, R., Pedersen, J.T., Zhou, H.-B., 
Luo, R., Fidelis, K., Moult, J., PROTEINS, (1995) 23, 327-336.

25) Evaluation of comparative protein modeling by MODELLER. Sali, A., 
Potterton, L., Yuan, F., Vlijmen, H. van, Karplus, M., PROTEINS, (1995) 23, 318-
326.

26) Modelling mutations and homologous proteins. Sali, A., Curr.Op.Struc.Biol., 
(1995) 6, 437-451.

27) Detection of common three dimensional substructures in proteins. Vriend, 
G., Sander, C., PROTEINS (1991) 11, 52-58.

28) Multiple protein structure alignment from tertiary structure comparison: 
assignment of global and residue confidence levels. Russell, R.B., Barton, G.J., 
PROTEINS (1992) 14, 309-323.

29) Identification of protein folds: Matching hydrophobicity patterns of sequence 
sets with solvent accessibility patterns of known structures. Bowie, J.U., Clarke, 
N.D., Pabo, C.O., Sauer, R.T., PROTEINS (1990) 7, 257-264.

30) Identification of tertiary structure resemblance in proteins using a maximal 
common subgraph isomorphism algorithm. Grindley, H.M., Artymiuk, P.J., 
Rice, D.W., Willett, P., J.Mol.Biol., (1993) 229, 707-721.

31) The alignment of protein structures in three dimensions. Zuker, M., 
Somorjai, R.L., Bull. Math.Biol. (1989) 51, 55-78.

31) A rapid method for protein structure alignment. Orengo, C.A., Taylor, W.R., 
J.Theor.Biol., (1990) 147, 517-551.

32) Comparison of three-dimensional structures of homologous proteins. 
Overington, J.P., Curr.Op.Struc.Biol., (1992) 2, 394-401.

33) A variable gap penalty function and feature weights for protein 3-D structure 
comparisons. Zhu, Z.-Y., Sali, A., Blundell, T.L., Prot.Engin., (1992) 5, 43-51.

34) Fast structure alignment for database searching. Orengo, C.A., Brown, N.P., 
Taylor, W.R., PROTEINS (1992) 14, 139-167.

35) Size independent comparison of protein three dimensional structures. 
Maiorov, V.N., Crippen, G.M., PROTEINS, (1995) 22, 273-283.

36) Common spatial arrangements of backbone fragments in homologous and 
non-homologous proteins. Alexandrov, N.N., Takahashi, K., Go, N., 
J.Mol.Biol., (1992) 225, 5-9.

37) An efficient automated computer vision based technique for detection of 
three dimensional structural motifs in proteins. Fisher, D., Bachar, O., 
Nussinov, R., Wolfson, H., J.Biolol.Struct.&Dyn., (1992) 9, 769-789.

38) Techniques for the calculation of three dimensional structural similarity 
using inter-atomic distances. Pepperrell, C., Willett, P., J.Comp.-Aid.Mol.Des., 
(1991) 5, 455-474.

39) Significance of root-mean-square deviation in comparing three-dimensional 
structures of globular proteins. Maiorov, V.N., Crippen, G.M., J.Mol.Biol., 
(1994) 235, 625-634.

40) A protein structure comparison methodology. Brown, N.P., Orengo, C.A., 
Taylor, W.R., Comp.Chem. (1996) 20, 359-380.

41) Protein structure alignment. Taylor, W.R., Orengo, C.A., J.Mol.Biol., (1988) 
208, 1-22.

42) Definition of general topological equivalence in protein structures. Sali, A., 
Blundell, T.L., J.Mol.Biol., (1990) 212, 403-428.

43) Comparison of conformational characteristics in structurally similar protein 
pairs. Flores, T.P., Orengo, C.A., Moss, D.S., Thornton, J.M., Prot.Sci., (1993) 2, 
1811-1826.

44) Protein structure comparison by alignment of distance matrices. Holm, L., 
Sander, C., J.Mol.Biol., (1993) 233, 123-138.

45) Biological meaning, statistical significance, and classification of local spatial 
similarities in nonhomologous proteins. Prot.Sci., (1994) 3, 866-875.

46) Founding fathers and families. Brändén, C.-I., Nature, (1990) 346, 607-608.

47) A database of protein structure families with common folding motifs. Holm, 
L., Ouzounis, C., Sander, C., Tuparev, G., Vriend, G., Prot.Sci., (1992) 1, 1691-
1698.

48) Searching protein structure databases has come of age. Holm, L., Sander, C., 
PROTEINS, (1994) 19, 165-173.

49) SCOP: A structural classification of proteins database for investigation of 
sequence and structures. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C., 
J.Mol.Biol., (1995) 247, 536-540.

50) OB (oligonucleotide/oligosaccharide binding)-fold: common structural and 
functional solution for non-homologous sequences. Murzin, A.G., EMBO, 
(1993) 12, 861-867. 

51) Structural features can be unconserved in proteins with similar folds. Russell, 
R.B., Barton, G.J., J.Mol.Biol., (1994) 244, 332-350.

52) Different protein sequences can give rise to highly similar folds through 
different stabilizing interactions. Laurents, D.V., Subbiah, S., Levitt, M., 
Prot.Sci., (1994) 3, 1938-1944.

53) Thiol proteases. Comparative studies on the high resolution structures of 
papain and actinidin, and on amino acid sequence information for cathepsins 
B and H, and stem bromelian. Kamphuis, I.G., Drenth, J., Baker, E.N., (1985) 
182, 317-329.

54) Similarity of active-site structures. Pearl, L., Nature, (1993) 362, 24.

55) Three-dimensional, sequence order-independent structural comparison of a 
serine protease against the crystallographic database reveals active site 
similarities: potential implications to evolution and to protein folding. Fisher, 
D., Wolfson, H., Lin, S.L., Nussinov, R., Prot.Sci., (1994) 3, 769-778.

56) Plastic adaptation toward mutation in proteins: structural comparison of 
thymidilate synthases. Perry, K.M., Fauman, E.B., Finer-Moore, J.S., Montfort, 
W.R., Maley, G.F., Maley, F., Stroud, R.M., PROTEINS, (1990) 8, 315-333.

57) Three dimensional structural resemblance between leucine aminopeptidase 
and carboxypeptidase A revealed by graph-theoretical techniques. Artymiuk, 
P.J., Grindley, H.M., Park, J.E., Rice, D.W., Willett, P., FEBS Lt., (1992) 303, 48-52.

58) Recurrence of a binding motif? Swindells, M.B., Orengo, C.A., Jones, D.T. 
Pearl, L.H., Thornton, J.M., Nature, (1993) 362, 299.

59) PROCHECK: a program to check the stereochemical quality of protein 
structures. Laskowski, R.A., MacArthur, M.W., Moss, D.S., Thornton, J.M., 
J.Appl.Cryst., (1993) 26, 283-291.

60) Stereochemical quality of protein-structure coordinates. Morris, A.L., 
MacArthur, M.W., Hutchinson, E.G., Thornton, J.M., PROTEINS, (1992) 12, 
345-364.

61) Errors in protein structures. Hooft, R.W.W., Vriend, G., Sander, C., Abola, 
E.E., Nature, (1996) 381, 272.

62) Recognition of errors in three dimensional structures of proteins. Sippl, M.J., 
PROTEINS, (1993) 17, 355-362.

63) Assessment of protein models with three dimensional profiles. Lüthy, R., 
Bowie, J.U., Eisenberg, D., Nature, (1992) 356, 83-85.

64) Criteria that discriminate between native proteins and incorrectly folded 
models. Novotny, J., Rashin, A.A., Brucoleri, R.E., PROTEINS, (1988) 4, 19-30.

65) Knowledge-based prediction of protein structures and the design of novel 
molecules. Blundell, T.L., Sibanda, B.L., Sternberg, M.J.E., Thornton, J.M., 
Nature, (1987) 326, 347-352.

66) Amino acid pair interchanges at spatially conserved locations. Naor, D., 
Fisher, D., Jernigan, R.L., Wolfson, H.J., Nussinov, R., J.Mol.Biol., (1996) 256, 
924-938.

67) Environment-specific amino acid substitution tables: tertiary templates and 
prediction of protein folds. Overington, J., Donnelly, D., Johnson, M.S., Sali, A., 
Blundell, T.L., Prot.Sci., (1992) 1, 216-226.

68) Recognition of distantly related proteins through energy calculations. 
Abagyan, R., Frishman, D., Argos, O., PROTEINS, (1994) 19, 132-140.

69) An empirical energy function for threading protein sequence through the 
folding motif. Bryant, S.H., Lawrence, C.E., PROTEINS, (1993) 16, 92-112. 

70) Prediction of protein structure by evaluation os sequence structure fitness. 
Ouzounis, C., Sander, C., Scharf, M., Schneider, R., J.Mol.Biol., (1993) 232, 805-
825.

71) Threading a database of protein cores. Madej, T., Gibrat, J.-F., Bryant, S.H., 
PROTEINS, (1995) 23, 356-369.

72) Protein structure prediction by threading methods: evaluation of current 
techniques. Lemer, C.M.-R., Rooman, M.J., Wodak, S.J., PROTEINS, (1995) 23, 
337-355.

73) Fold recognition and ab initio structure predictions using hiddem markov 
models and b-strand pair potentials. Hubbard, T.J., Park, J., PROTEINS, (1995) 
23, 398-402.

74) A branch-and-bound algorithm for optimal protein threading with pairwise 
(contactpotential) amino acid interactions. Lathrop, R.H., Smith, T.F., Proc. 27-
th Hawaii Intl. Conf. on System Sciences (1994) IEEE Comp. Soc. Press. 365-374.

75) A structural basis for sequence comparisons. Johnson, M.S., Overington, J.P., 
J.Mol.Biol., (1993) 233, 716-738.

76) Structural analysis based on state-space modeling. Stultz, C.M., White, J.V., 
Smith, T.F., Prot.Sci., (1993) 2, 305-314.

77) A Method to identify protein sequences that fold into a known three 
dimensional structure. Bowie, J.U., Lüthy, R., Eisenberg, D., Science, (1991) 253, 
164-170.

78) Rapid and sensitive comparison with FASTA and FASTP. Pearson, W.R., 
Meth.Enzym., (1990) 183, 63-98.

79) Basic local alignment search tool. Altschul, S.F., Gish, W., Miller, W., Myers, 
E.W., Lipman, D.J., J.Mol.Biol., (1990) 215, 403-410.

80) Atomic environment energies in proteins defined from statistics of accessible 
and contact surface areas. Delarue, M., Koehl, P., J.Mol.Biol., (1995) 249, 675-690.

81) Evaluation of protein models by atomic solvation preference. Holm, L., 
Sander, C., J.Mol.Biol., (1992) 225, 93-105.

82) Identification of protein sequence homology by consensus template 
alignment. Taylor, W.R., J.Mol.Biol., (1986) 188, 233-258.

83) A fast and sensitive multiple sequence alignment algoritm. Vingron, M., 
Argos, O., CABIOS, (1989) 5, 115-121.

84) A method for multiple sequence alignment with gaps. Subbiah, S., Harrison, 
S.C., J.Mol.Biol., (1989) 209, 539-548.

85) Improving the sensitivity of the sequence profile method. Lüthy, R., Xenarios, 
I., Bucher, P., Prot.Sci., (1994) 3, 139-146.

86) Pattern-induced multi sequence alignment (PIMA) algorithm employing 
secondary structure-dependent gap penalties for use in comparative protein 
modelling. Smith, R.F., Smith, T.F., Prot.Engin., (1992) 5, 35-41.

87) Sequence ordinations: a multivariate analysis approach to analysing large 
sequence data sets. Higgins, D.G., CABIOS, (1992) 8, 15-22.

88) A strategy for the rapid multiple alignment of protein sequences. Confidence 
levels from tertiary structure comparisons. Barton, G.J., Sternberg, M.J.E., 
J.Mol.Biol., (1987) 198, 327-337.

89) Recognition of related proteins by iterative template refinement. Yi, T.-M., 
Lander, E.S., Prot.Sci., (1994) 3, 1315-1328.

90) The three dimensional profile method using residue preference as a 
continuous function of residue environment. Zhang, K.Y.J., Eisenberg, D., 
Prot.Sci., (1994) 3, 687-695.

91) A possible three-dimensional structure of bovine a-lactalbumin based on that 
of hen¹s egg-white lysozyme. Brown, W.J., North, A.C.T., Phillips, D.C., Brew, 
K., Vanaman, T.C., Hill, R.C., J.Mol.Biol., (1969) 42, 65-86.

92) Computation of structure of homologous proteins: a-lactalbumin from 
lysozyme. Warme, P.K., Momany, F.A., Rumball, S.V., Scheraga, H.A., 
Biochemistry (1974) 13, 768-782.

93) Prediction of protein side-chain conformations from local three dimensional 
homology reletionships. Laughton, C.A., J.Mol.Biol., (1994) 235, 1088-1097.

94) Analysis of the relationship between side-chain conformation and secondary 
structure in globular proteins. McGregor, M.J., Islam, S.A., Sternberg, M.J.E., 
J.Mol.Biol., (1987) 198, 295-310.

95) Tertiary templates for proteins. Ponder, J.W., Richards, F.M., J.Mol.Biol., (1987) 
193, 775-791.

96) Rotamers, to be or not to be? Schrauber, H., Eisenhaber, F., Argos, O., 
J.Mol.Biol., (1993) 230, 592-612.

97) Fast and simple Monte Carlo algorithm for side chain optimization in 
proteins: application to model building by homology. Holm, L., Sander, C., 
PROTEINS, (1992) 14, 213-223.

98) Modelling of side chains, loops and insertions in proteins. Summers, N.L., 
Karplus, M., Meth.Enzym., (1991) 202, 156-205

99) Construction of side-chains in homology modelling. Application to the C-
terminal lobe of rhizopuspepsin. Summers, N.L., Karplus, M., J.Mol.Biol., 
(1989) 210, 785-811.

100) A method to configure protein side-chains from the main-chain trace in 
homology modelling. Eisenmenger, F., Argos, O., Abagyan, R., (J.Mol.Biol., 
(1993) 231, 849-860.

101) The dead-end elimination theorem and its use in protein side-chain 
positioning. Desmet, J., Maeyer, M. De., Hazes, B., Lasters, I., Nature, (1992) 356, 
539-542.

102) New paths from death ends. Taylor, W., Nature, (1992) 356, 478-480.

103) Predicting local structural changes that result from point mutations. Filippis, 
V.de, Sander, C., Vriend, G., Prot.Engin., (1994) 7, 1203-1208.

104) Backbone-dependent rotamer library for proteins. Application to side-chain 
prediction. Dunbrack, R.L.Jr., Karplus, M., J.Mol.Biol., (1993) 230, 543-574.

105) Evidence for strained interactions between side-chains and the polypeptide 
backbone. Stites, W.E., Meeker, A.K., Shortle, D., J.Mol.Biol., (1994) 235, 27-32.

106) Conformational analysis of the backbone dependent rotamer preferences of 
protein side chains. Dunbrack, R.L.Jr., Karplus, Nature Struc.Biol., (1994) 5, 334-
340.

107) The use of position specific rotamers in model building by homology. 
Chinea, G., Padron, G., Hooft, R.W.W., Sander, C., Vriend, G., PROTEINS, 
(1995) 23, 415-421.

108) Detailed ab initio prediction of lysozyme-antibody complex with 1.6 A 
accuracy. Totrov, M.M., Abagyan, R.A., Nature Struct. Biol., (1994) 1, 259-265.

109)Accurate prediction of stability and activity effects of site directed mutagenesis 
on a protein core. Lee, C., Levitt, M., Nature (1991) 352, 448-451.

110) Prediction of the stability and activity effects of site directed mutagenesis. 
Gunsteren, W.F. van, Mark, A.E., J.Mol.Biol., (1992) 227, 389-395.

111) Thermodynamics of protein peptide interactions in the ribonuclease S 
system studied by molecular dynamics and free energy calculations. Simonson, 
T., Brunger, A.T., Biochemistry (1992) 31, 8661-8674.

112) Prediction and analysis of structure, stability and unfolding of thermolysin 
like proteases. Vriend, G., Eijsink, V.G.H., J.Comp.-Aid Mol.Des. (1993) 7, 367-
396.

113) A novel search method for protein sequence-structure relations using 
property profiles. Vriend, G., Sander, C., Stouten, P.W.F., Prot.Engin. (1994) 7, 
23-29.

114) Using known substructures in protein model building and crystallography. 
Jones, T.A., Thirup, S., EMBO, J., (1986) 5, 819-823.

115) Selection of representative protein data sets. Hobohm, U., Scharf, M., 
Schneider, R., Sander, C., Prot.Sci., (1992) 1, 409-417.

116) Verification of protein structures: side-chain planarity. Hooft, R.W.W., 
Sander., C., Vriend, G., Cabios, accepted.

117) Intelligent databases. Parsaye K., Chignell, M., Khoshafian, S., Wong, H., 
John Wiley and sons, Inc., (1989).

118) PKB: A program system and data base for analysis of protein structure. 
Bryant, S.H., PROTEINS (1989) 5, 233-247.

119) Parameter relation rows: a query system for protein structure function 
relationships. Vriend, G., Prot.Engin., (1990) 4, 221-223.

120) A relational data base of protein structures designed for flexible enquiries 
about conformation. Prot.Engin., (1989) 2, 431-442.

121) An object oriented database for protein structure analysis. Gray, P.M.D., 
Paton, N.W., Kemp, G.J.L., Fothergill, J.E., Prot.Engin., (1990) 3, 235-243.

122) SESAM: A relational database for structure and sequence of macromolecules. 
Huysmans, M., Richelle, J., Wodak, S.J., PROTEINS, (1991) 11, 59-76.

123) The protein data bank: A computer based archival file for macromolecular 
structures. Bernstein, F. C., Koetzle, T. F., Williams, G. B., Meyer, E. F. Jr.,Brice, 
M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. ; Tatsumi, M. J.Mol.Biol. 
(1977) 112, 535-542.

124) IPSA-Inductive protein structure analysis. Schultze-Kremer, S., King, R.D., 
Prot.Engin., (1992) 5, 377-390.

125) GBPARSE: a parser for the GenBank flat-file format with new feature table 
format. Read, R.L., Davison, D., Chappelear, J.E., Garavelli, J.S., CABIOS, (1992) 
8, 407-408.

126) A cross reference table between the protein data bank of macromolecular 
structures and the national biomedical research foundation protein 
identification resource amino acid sequence data bank. Lesk, A.M., Boswell, 
D,R., Lesk, V.I, Lesk, V.E., Bairoch, A., Prot.Seq.Data.Anal., (1989) 2, 295-308.

127) The EMBL data library. Stoehr, P.J., Cameron, G.N., NAR, (1991) 19, 2227-
2230.

128) Protein motifs and database searching. Thorton, J.M., Gardner, S.P., TIBS, 
(1989) 14, 300-304.

129) A profile for molecular biology databases and information resources. Kamel, 
N.N., CABIOS, (1992) 8, 311-321.

130) To be published. Jia, Z., Quail, J. W., Waygood, E. B., Delbaerre L. T. J. (1993), 
Deposited in the PDB.

131) Limits to modelbuilding by homology. Rodriguez, R., Vriend, G., to be 
submitted.

132) To be published. Jia, Z., Vandonselaar, M., Hengstenberg W., ,Quail, J. W., 
Delbaerre L. T. J. (1993), Deposited in the PDB.

133) Crystallographic analysis of a complex between human immunodeficiency 
virus type 1 protease and acetyl pepstatin at 2.0 Angstrom resolution. 
Fitzgerald, P. M. D., Mc Keever, B. M., Van Middlesworth, J. F., Springer, J. P., 
Heimbach, J. C., Leu, C. T., Herber, W. K., Dixon, R. A. F., Darke, P. L. (1990) J. 
Biol. Chem. 265, 14209-.

134) Refined 1.6 A resolution crystal structure of the complex formed between 
porcine  b-trypsin and MCTI-A, a trypsin inhibitor of the squash family. 
Huang, Q., Liu, S., Tang, Y. (1993) J. Mol. Biol. 229, 1022-.

135) WHAT IF: A molecular modelling and drug design program. G. Vriend, 
J.Mol.Graph. (1990) 8, 52-56.

136) Hubbard, R.E., In: Computer Graphics and molecular modelling. Edt. 
Fletterick, R.J., Zoller, M., Cold Spring Harbor, (1986) 9-12.

137) A graphics modelbuilding and refinement system for macromolecules. 
Jones, T.A., J.Appl.Cryst. (1978) 268-272.

138) Interactive program for visualization and modelling of proteins, nucleic 
acids and small molecules. Dayringer, H.E., Tramontano, A., Fletterick, R.J., 
J.Mol.Graph. (1986) 4, 82-87.

139) Improved methods for buildin protein models in electron density maps and 
the location of errors in these models. Jones, T.A., Zou, J.Y., Cowan, S.W., 
Kjelgaard, M., Acta Cryst A (1991) 47, 110-119.

140) BRAGI: A comprehensive protein modelling program system. Schomburg, 
D., Reichelt, J., J.Mol.Graph. (1988) 6, 161-165.

141) An algorithm for determining the conformation of polypeptide segments in 
proteins by systematic search. Moult, J., James, M.N.G., PROTEINS (1986) 1, 
146-163.

142) Prediction of the folding of short polypeptide segments by uniform 
conformational sampling. Bruccoleri, R.E., Karplus, M., Biopolymers (1987) 26, 
137-168.

143) Predicting antibody hypervariable loop conformations. II: minimization and 
molecular dynamics studies of MCPC603 from many randomly generated loop 
conformations. Fine, R.M., Wang, H., Shenkin, P.S., Yarmush, D.L., Levinthal, 
C., PROTEINS (1986) 1, 342-362.

144) A new method for building protein conformations from sequence 
alignments with homologues with know structure. Havel, T.F., Snow, M.E., 
J.Mol.Biol. (1990) 217, 1-7.

145) Assembly of polypeptide and backbone conformations from low energy 
ensambles of short fragments. Sippl, M.J., Hendlich, M., Lackner, P., Prot.Sci. 
(1992) 1, 625-640.

146) Calculation of protein conformation as an assembly of stable overlapping 
segments: application to BPTI. Simon, I., Glasser, L., Scheraga, H.A., PNAS 
(1991) 88, 3661-3665.

147) On the multiple minima problem in the conformational analysis of 
polypeptides. Ripoll, D.R., Scheraga, H.A., Biopolymers (1990) 30, 165-176.

148) A large scale experiment to assess protein structure prediction methods. 
Moult, J., Judson, R., Fidelis, K., Pedersen, J.T., PROTEINS (1995) 23, ii-iv.

149) Polarity as a criterion in protein design. Baumann, G., Froemmel, C., Sander, 
C., Prot.Engin. (1989) 2, 329-334.

150) Correctly folded proteins make twice as many hydrophobic contacts. Bryant, 
S.H., Amzel., L.M., Int.J.Pept.Prot.Res. (1987) 29, 46-52.

151) Identification of native protein folds amongst a large number of incorrect 
models. Hendlich, M., Lackner, P., Weitcus, S., Floeckner, H., Froschauer, R., 
Gottsbacher, K., Cassari, G., Sippl, M.J., J.Mol.Biol. (1990) 216, 167-180.

152) Stereochemical quality of protein structure coordinates. Morris, A.L., 
MacArthur, M.W., Hutchinson, E.G., Thorton, J.M., PROTEINS (1992) 12, 3456-
364.

153) Solvation energy in protein folding and binding. Eisenberg, D., McLachlan, 
A.D., Nature, (1986) 319, 199-203.

154) Novel method for the rapid evaluation of packing in protein structures. 
Gregoret, L.M., Cohen, F.E., J.Mol.Biol. (1990) 211, 959-974.

155) Quality control of protein models: directional atomic contact analysis. 
Vriend, G., Sander, C., J.Appl.Cryst. (1993) 26, 47-60.

156) GROMOS. Van Gunsteren, W.F., Berendsen, H.J., (1987) BIOMOS, 
Biomolecular software, Lab. Phys. Chem., Uni., Groningen, The Netherlands.

© June 21 2000 G Vriend