SCRATCH: A Quick Description

Methods

SSpro v 4.5

SSpro is a server for protein secondary structure prediction based on an ensemble of 100 1D-RNNs (one dimensional recurrent neural networks). For a detailed explanation of the methods see in the references. SSpro version 1 was online on 3/13/2000. In one year it handled more than 10,000 queries from 60 domains, at least 50 countries all over the world.

From the very beginning SSpro 3.0 was tested by the independent assessor EVA, and showed a performance constantly exceeding 76% correctly classified residues on structures with no homologues in PDB, thus ranking always in first position among the servers tested. SSpro v4.0 currently achieves an accuracy of 78.7% on the independent evaluator server EVA. SSpro v4.5 includes the direct incorporation of homologous protein's secondary structure and probablistic methods to improve the SOV score.

Download SSpro4.0 Executable for Linux (Academic Use Only)

SSpro8

SSpro8 is an experimental extension to SSpro. Instead of using three classes (helix, strand and the rest) to assign the secondary structure of a protein, SSpro8 adopts the full DSSP 8-class output classification:

H: alpha-helix
G: 3-10-helix
I: pi-helix (extremely rare)
E: extended strand
B: beta-bridge
T: turn
S: bend
C: the rest

For a detailed description of the tests performed on SSpro8, see the references. The overall performance (Q8) of the system currently online (based on PSI-BLAST profiles) is approximately 63%. NOTE: SSpro8 is a completely different system from SSpro. Their results may not match.

ABTMpro

ABTMpro is a three-class predictor of transmembrane type. The input is the amino acid sequence and the output is "alpha helical transmembrane", "beta barrel transmembrane" or "non-transmembrane".

CONpro

CONpro is a server that predicts whether the number of contacts of each residue in a protein is above or below the average for that residue. The prediction of CONpro is based on 1D-RNNs, adopting as input a multiple alignment of homologues generated by PSI-BLAST. The threshold radius at which residues are considered in contact is at 12Å. The accuracy of CONpro is 73%. The complete system is an ensemble of 10 1D-RNNs. For a more detailed explanation, see in the references.

ACCpro

ACCpro is a server for the prediction of the relative solvent accessibility of protein residues. The prediction of ACCpro is based on 1D-RNNs, adopting as input a multiple alignment of homologues generated by PSI-BLAST. Each residue in a protein is predicted as buried or exposed, i.e. less or more accessible than a specified threshold. All thresholds between 0% and 95% at steps of 5% are available. For a 25% threshold, the 'hard' case corresponding to practically identical numbers of buried and exposed residues, ACCpro classifies correctly 77.2% of the residues, better than any other system previously described. For a more detailed explanation, see in the references.

Download ACCpro Executable for Linux (Academic Use Only)

ACCpro20

ACCpro20 predicts the relative solvent accessiblity at all thresholds between 0% and 100% at 5% increments.

DOMpro

DOMpro predicts domain locations using a 1D-RNN. DOMpro takes an input the sequence profile, predicted secondary structure, and predicted relative solvent accessiblity. The output of the 1D-RNN is a classification for each residue as being in a domain boundary region or not. The domains are then infered from this output. For a more detailed explanation, see the manuscript in references.

DISpro

DISpro uses a 1D-RNN to predict the probablity that residues are disorder. The probabilities are also thresholded at probablity .5 to make a hard classification. The input to DISpro is the sequence profile, predicted secondary structure, and predicted relative solvent accesiblity. For a more detailed explanation, see the manuscript in references.
Download the dataset used in training DISpro here.

DIpro

DIpro is a cysteine disulfide bond predictor based on 2D recurrent neural network, support vector machine, graph matching and regression algorithms. It can predict if the sequence has disulfide bonds or not, estimate the number of disulfide bonds, and predict the bonding state of each cysteine and the bonded pairs. It yields the best accuracy on the benchmark dataset Sp39. It can handle any number of disulfide bonds where most of methods available so far only can handle less than 6 disulfide bonds.

Procedure: The seqeunce is processed in two steps. Step 1, use support vector machine to classify if the sequence has disulfide bonds or not. Step 2, use neural network and graph algorithm to predict the number of bonds, bond pattern. For a more detailed explanation, see in references.

CMAPpro

CMAPpro is a server for the prediction of maps of contacts between protein residues. The prediction of CMAPpro is based on ensembles of Generalised Recurrent Neural Networks for the translation of matrices. The input of the system consists of two-dimensional profiles extracted from multiple alignments of homologues generated by PSI-BLAST, and of secondary structure and solvent accessibility predictions obtained respectively from SSpro and ACCpro. Maps at 8Å and 12Å are available, meaning that two amino acids are defined as being in contact if their C-α are closer than 8Å and 12Å respectively. For a description of the tests performed on CMAPpro, see the references.

3Dpro

3Dpro is a server that predicts protein tertiary structure. 3Dpro uses predicted structural features, and PDB knowledge based statistical terms in the energy function. The conformational search uses a move set consisting of fragment replacement (using a fragment library built from the PDB) as well as random perturbations to the model. Moves are selected or rejected based on a simulated annealing method with linear cooling. Multiple models are constructed using random seeds and the model with the lowest energy is selected as the final prediction. 3Dpro is currently a de nuvo method (structural templates are not used).

The results of 3Dpro's performance at CASP6 can be found here.

Input formats

Email

Your email address, the place where the prediction will be delivered. NOTE: Check that you typed your address correctly. Approximately 5% of the queries handled by SSpro 1.0 didn't receive an answer because of incorrect typing.

Query name

An optional name for your query. We strongly suggest that you use one, especially if sending more than one query. The order in which you send your queries may not correspond to the order in which you receive the answers.

Input sequence

The sequence of amino acids:

A bare sequence is accepted. Please no FASTA format.
Spaces, newlines and tabs will be ignored, so feel free to have them in your query.
Letters not corresponding to any amino acid will be treated as X.
Non alphabetical chars will cause the rejection of the query.
Only 1 letter amino acid code accepted. Please do not send nucleotide sequences. If so, A will be treated as Alanine, C as Cysteine, etc...

Output format

Replies are sent by email. SSpro, SSpro8, ACCpro, ACCpro20, DOMpro, DISpro, DIpro and CONpro replies come as text, embedded in the body of the email. Here you have an example of prediction:

Name: short JOB ID: 98453 Amino Acids: MQIFVKTLTGKTITLEVEPSDTIENVKAKI Predicted Secondary Structure: CEEEEEEECCCEEEEEECCCCCHHHHHCCC Predicted Secondary Structure (8 Class): CEEEEEEEESEEEEEEECCCSHHHHEECCC Predicted Relative Solvent Accessiblity (at 25% exposed threshold): ee---ee-eeee-e-e-eeeee-ee-eeee Predicted Relative Solvent Accessiblity (All Thresholds): 0% ee---e--e-ee-e-e-e-eee-ee-eeee 5% ee---e--e-ee-e-e-e-eee-ee-eeee 10% ee---e--e-ee-e-e-e-eee-ee-eeee 15% ee---e--e-ee-e-e-e-eee-ee-eeee 20% ee---e--e-ee-e-e-e-eee-ee-eeee 25% ee---e--e-ee-e-e-e-eee-ee-eeee 30% ee---e--e-ee-e-e-e-eee-ee-eeee 35% ee---e--e-ee-e-e-e-eee-ee-eeee 40% ee------e--e-e-e---eee-ee-eeee 45% ee---------------------e--eeee 50% e----------------------e--eeee 55% e-------------------------e-ee 60% e---------------------------ee 65% e---------------------------ee 70% e---------------------------ee 75% e---------------------------ee 80% e---------------------------ee 85% e---------------------------ee 90% e---------------------------ee 95% ------------------------------ Predicted Contact Number: ------------------------------ Predicted Disordered Residues: OOOOOOOOOOOOOOOOOOOOOOOOOOOODD Predicted Disorder Probability: 0.16 0.07 0.04 0.04 0.03 0.02 0.02 0.01 0.02 0.02 0.02 0.01 0.01 0.01 0.02 0.02 0.03 0.04 0.06 0.06 0.12 0.13 0.20 0.17 0.18 0.20 0.23 0.48 0.55 0.58 Predicted Domains: Domain 1: 1 - 30 Predicted Disulfide Bonds: Input sequence has LESS THAN TWO cysteins and therefore cannot form disulfide bonds. Predicted Contact Maps: SEE ATTACHMENTS

The predictions have the following meaning:

Amino Acids: The 1-letter code of your protein primary sequence. This line is always present.
SSpro: secondary structure prediction:
- H = helix
- E = strand
- C = the rest
SSpro8: 8-class secondary structure prediction:
- H: alpha-helix
- G: 310-helix
- I: pi-helix (extremely rare)
- E: extended strand
- B: beta-bridge
- T: turn
- S: bend
- C: the rest
ACCpro and ACCpro20: Prediction of relative solvent accessibility:
- - : the residue is buried
- e : the residue is exposed
CONpro: Predictions of number of residue contacts at 12Å relative to the amino acids average number of contacts:
- - : the residue has fewer contacts than average
- + : the residue has more contacts than average
DISpro: Predictions of number residue order/disorder:
- O : the residue is ordered
- D : the residue is disordered
In addition, probabilities of disorder are provided.
DOMpro: Prediction of domains:
Domain index: start residue - end residue
DIpro: Prediction of disulfide bonds:
DIpro makes two predictions. First, DIpro predicts if disulfide bonds exist in the protein. Second, DIpro predicts the location of the bonds. These predictions are made independent of each other. In this sample prediction, there are not enough cysteins to form a disulfide bond.
CMAPpro: Prediction of contact maps:
The predictions come as attached raw files, named contact_map.8a and contact_map.12a. If the query is N amino acids long the files are composed of N lines, each containing N space-separated real numbers. The j-th number on line i-th represents the estimated probability that amino acids i and j are in contact (i.e. of their C-αs being closer than the threshold).
3Dpro: Prediction of tertiary structure:
The predictions come in a separate email and contain a pdb file with the coordinates of the C-&alpha atoms.

Note: Since CMAPpro and 3Dpro predictions are computationally intensive only proteins of length at most 400 amino acids will be accepted if CMAPpro or 3Dpro predictions are selected.

Return to SCRATCH

References

For a general overview see:

J. Cheng, A. Randall, M. Sweredoski, P. Baldi, SCRATCH: a Protein Structure and Structural Feature Prediction Server, Nucleic Acids Research, Special Issue on Web servers, in press, 2005.

P. Baldi and G. Pollastri, "The Principled Design of Large-Scale Recursive Neural Network Architectures-DAG-RNNs and the Protein Structure Prediction Problem", Journal of Machine Learning Research, 4, 575-603, (2003).
Download PDF.

P.Baldi, G.Pollastri, "Machine Learning Structural and Functional Proteomics", IEEE Intelligent Systems (Intelligent Systems in Biology II), March/April 2002.
Download PDF.

For an explanation of the methods used in SSpro and SSpro8 see:

G.Pollastri, D.Przybylski, B.Rost, P.Baldi, "Improving the Prediction of Protein Secondary Structure in Three and Eight Classes Using Recurrent Neural Networks and Profiles", Proteins, 47, 228-235, 2002.
Download PDF, Abstract and HTML (Proteins web site).

Or:
P.Baldi, S.Brunak, P.Frasconi, G.Pollastri, and G.Soda, "Exploiting the Past and the Future in Protein Secondary Structure Prediction", Bioinformatics, 15, 937-946, (1999).
Download PDF, HTML (Bioinformatics web site).

Or (quick abstract):
Pollastri,G.,Baldi,P., "SSpro, a web server for protein secondary structure prediction based on recurrent neural networks"
Proceedings of CASP2000, Asilomar, CA
HTML version, and gzipped postscript.

A more detailed description of 1D-RNNs (formally called bidirectional recurrent neural networks (BRNNs) can be found here:
Baldi,P., Brunak,S., Frasconi,P., Pollastri,G., and Soda,G., "Bidirectional Dynamics for Protein Secondary Structure Prediction", in Sequence Learning: Paradigms, Algorithms, and Applications, R. Sun and L. Giles Editors, Springer Verlag, (2000).
Download PDF, Abstract (Book web site)

For an explanation of the methods used in ACCpro and CONpro see:

P. Baldi and G. Pollastri. "The Principled Design of Large-Scale Recursive Neural Network Architectures—DAG-RNNs and the Protein Structure Prediction Problem", Journal of Machine Learning Research, 4, 575-602, 2003.
Download PDF , Abstract and HTML (JMLR web site)

G.Pollastri, P.Baldi, P.Fariselli, R.Casadio, "Prediction of Coordination Number and Relative Solvent Accessibility in Proteins", Proteins, 47, 142-153, 2002.
Download PDF, Abstract and HTML (Proteins web site)

Or:
Pollastri,G., Baldi,P., Fariselli,P., Casadio,R., "Improved Prediction of the Number of Residue Contacts in Proteins by Recurrent Neural Networks", Bioinformatics, 17 Suppl 1, S234-S242 (2001).
Download PDF, HTML abstract (Bioinformatics web site).

For an explanation of the methods used in DOMpro see:

J. Cheng, M. Sweredoski, P. Baldi, "DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks", submitted, 2005.
Download PDF,

For an explanation of the methods used in DISpro see:

J. Cheng, M. Sweredoski, P. Baldi, "Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data", Data Mining and Knowledge Discovery, in press, 2005.
Download PDF,

For an explanation of the methods used in DIpro see:

P.Baldi, J. Cheng, A. Vullo, "Large-Scale Prediction of Disulphide Bond Connectivity", Advances in Neural Information Processing Systems (NIPS 2004) 17,L. Saul ,Y. Weiss, and L. Bottou editors, MIT press, pp.97-104, Cambridge, MA, 2005.
Download PDF,

For an explanation of the methods used in CMAPpro see:

G.Pollastri, P.Baldi, "Prediction of Contact Maps by Recurrent Neural Network Architectures and Hidden Context Propagation from All Four Cardinal Corners", Bioinformatics, 18 Suppl 1, S62-S70 (2002).
Download PDF, HTML abstract (Bioinformatics web site).

And:
P.Baldi, G.Pollastri, "Machine Learning Structural and Functional Proteomics", IEEE Intelligent Systems (Intelligent Systems in Biology II), March/April 2002.
Download PDF.

Return to SCRATCH