SCRATCH: A Quick Description

Methods

SSpro

SSpro is a server for protein secondary structure prediction based on protein evolutionary information (sequence homology) and homologous protein's secondary structure (structure homology). For a detailed explanation of the methods, please refer to the references listed at the bottom of this page. SSpro currently achieves a performance exceeding 84% correctly classified residues on proteins with no homologs in the PDB and in the range of 87.8% to 98.7% correctly classified residues on proteins where homologs can be found in the PDB, ranking on top of the tested prediction servers.

Download SSpro 6.0 (free for academic use)

SSpro8

SSpro8 is an extension to SSpro. Instead of using three classes (helix, strand and the rest) to assign the secondary structure of a protein, SSpro8 adopts the full DSSP 8-class output classification:

H: alpha-helix
G: 3-10-helix
I: pi-helix (extremely rare)
E: extended strand
B: beta-bridge
T: turn
S: bend
C: the rest

For a detailed description of the tests performed on SSpro8, see the references. The overall performance of the system currently online is approximately 72% correctly classified residues on proteins with no homologs in the PDB and above 78% to 98% when homologs can be found in the PDB. NOTE: SSpro8 is a completely different system from SSpro. Their results may not match.

Download SSpro8 6.0 (free for academic use)

ABTMpro

ABTMpro is a server that predicts whether a given protein sequence is a transmembrane protein. If the given protein sequence is a transmembrane protein, ABTMpro further predicts the probabilities of the protein being an alpha helical transmembrane protein or a Beta Barrel transmembrane protein. The prediction framework consists of a Support Vector Machine, which utilizes features such as amino acid composition and properties, reduced alphabet composition, predicted secondary structure, evolutionary information etc. The overall accuracy of ABTMpro is upwards of 97%, and achieves MCC values of 0.93 and 0.94 on smaller data sets and MCC values of 0.85 and 0.63 on much larger tests for alpha helical and beta barrel transmembrane proteins respectively.

CONpro

CONpro is a server that predicts whether the number of contacts of each residue in a protein is above or below the average for that residue. The prediction of CONpro is based on 1D-RNNs, adopting as input a multiple alignment of homologues generated by PSI-BLAST. The threshold radius at which residues are considered in contact is at 12Å. The accuracy of CONpro is 73%. The complete system is an ensemble of 10 1D-RNNs. For a more detailed explanation, see in the references.

ACCpro

ACCpro is a server for the prediction of the relative solvent accessibility of protein residues. The prediction of ACCpro is based on 1D-RNNs, adopting as input a multiple alignment of homologs generated by PSI-BLAST. Each residue in a protein is predicted as buried or exposed, i.e. more or less accessible than a specified threshold. For the 25% threshold, the 'hard' case corresponding to practically identical numbers of buried and exposed residues, ACCpro correctly classifies ~81% of the residues and up to 90% when homologs exist in the PDB, better than any other system previously described. For a more detailed explanation, see in the references.

Download ACCpro 6.0 (free for academic use)

ACCpro20

ACCpro20 predicts the relative solvent accessiblity at all thresholds between 0% and 95% at 5% increments. It is a 20-class variant of ACCpro predictor. Performance of the system currently online is at the same level than ACCpro predictor in the hard case (25% accessibility threshold), and higher for the other thresholds.

Download ACCpro20 6.0 (free for academic use)

DOMpro

DOMpro predicts domain locations using a 1D-RNN. DOMpro takes an input the sequence profile, predicted secondary structure, and predicted relative solvent accessiblity. The output of the 1D-RNN is a classification for each residue as being in a domain boundary region or not. The domains are then infered from this output. For a more detailed explanation, see the manuscript in references.

Download DOMpro 1.0 executable for Linux (free for academic use) and the datasets used to train DOMpro.

DISpro

DISpro uses a 1D-RNN to predict the probablity that residues are disorder. The probabilities are also thresholded at probablity .5 to make a hard classification. The input to DISpro is the sequence profile, predicted secondary structure, and predicted relative solvent accesiblity. For a more detailed explanation, see the manuscript in references.
Download the dataset used in training DISpro here.

Download DISpro 1.0 executable for Linux (free for academic use)

DIpro

DIpro is a cysteine disulfide bond predictor based on 2D recurrent neural network, support vector machine, graph matching and regression algorithms. It can predict if the sequence has disulfide bonds or not, estimate the number of disulfide bonds, and predict the bonding state of each cysteine and the bonded pairs. It yields the best accuracy on the benchmark dataset Sp39. It can handle any number of disulfide bonds where most of methods available so far only can handle less than 6 disulfide bonds.

Procedure: The seqeunce is processed in two steps. Step 1, use support vector machine to classify if the sequence has disulfide bonds or not. Step 2, use neural network and graph algorithm to predict the number of bonds, bond pattern. For a more detailed explanation, see in references.
Download DIpro 2.0 software(free for academic use)
Download the dataset used in training DIpro here.

CMAPpro

CMAPpro is a server for the prediction of maps of contacts between protein residues. The prediction of CMAPpro is based on ensembles of Deep Neural Networks, which take into account the spatial dependencies of contact occurrences in local neighborhoods. The input of the system consists of two-dimensional profiles extracted from multiple alignments of homologues generated by PSI-BLAST, secondary structure and solvent accessibility predictions obtained respectively from SSpro and ACCpro, and predicted coarse contacts and orientations between secondary structure elements using two-dimensional Recurrent Neural Networks. Maps at 8Å are available, meaning that two amino acids are defined as being in contact if their C-β atoms (C-α for glycines ) are closer than 8Å. For a description of the tests performed on CMAPpro, see the references.

SVMcon

SVMcon predicts medium- to long-range residue-residue contacts using Support Vector Machines. The contact predictions are in the CASP format (residue index 1, residue index 2, 0, 8, contact probability). The contact distance threshold is 8 angstrom. The sequence separation between two residues is at least 6 residues. For information, see the references.
[Download SVMcon 1.0]
[Download SVMcon Training Set]
[Download SVMcon Test Set]

For commercial license, please contact: igb-license [at] ics [.] uci [.]edu

3Dpro

3Dpro is a server that predicts protein tertiary structure. 3Dpro uses predicted structural features, and PDB knowledge based statistical terms in the energy function. The conformational search uses a move set consisting of fragment replacement (using a fragment library built from the PDB) as well as random perturbations to the model. Moves are selected or rejected based on a simulated annealing method with linear cooling. Multiple models are constructed using random seeds and the model with the lowest energy is selected as the final prediction. 3Dpro is currently a de nuvo method (structural templates are not used).

The results of 3Dpro's performance at CASP6 can be found here.

SOLpro

SOLpro predicts the propensity of a protein to be soluble upon overexpression in E. coli using a two-stage SVM architecture based on multiple representations of the primary sequence. Each classifier of the first layer takes as input a distinct set of features describing the sequence. A final SVM classifier summarizes the resulting predictions and predicts if the protein is soluble or not as well as the corresponding probability.

Download SOLpro (free for academic, non commercial, use).

ANTIGENpro

ANTIGENpro is a sequence-based, alignment-free and pathogen-independant predictor of protein antigenicity. The predictions are made by a two-stage architecture based on multiple representations of the primary sequence and five machine learning algorithms. A final SVM classifier summarizes the resulting predictions and predicts if the protein is likely to be antigenic or not as well as the corresponding probability. ANTIGENpro is the first predictor of the whole protein antigenicity trained using reactivity data obtained by protein microarray analysis for five pathogens.

VIRALpro

VIRALpro is a predictor capable of identifying capsid and tail protein sequences using support vector machines (SVM) with an accuracy estimated to be between 90% and 97%. Predictions are based on the protein amino acid composition, on the protein predicted secondary structure, as predicted by SSpro, and on a boosted linear combination of HMM e-values obtained from 3,380 HMMs built from multiple sequence alignments of specific fragments - called contact fragments - of both capsid and tail sequences.

Download VIRALpro 1.0 (free for academic use)

Input formats

Email

Your email address, the place where the prediction will be delivered. NOTE: Check that you typed your address correctly. Approximately 5% of the queries handled by SSpro 1.0 didn't receive an answer because of incorrect typing.

Query name

An optional name for your query. We strongly suggest that you use one, especially if sending more than one query. The order in which you send your queries may not correspond to the order in which you receive the answers.

Input sequence

The sequence of amino acids:

A bare sequence is accepted. Please no FASTA format.
Spaces, newlines and tabs will be ignored, so feel free to have them in your query.
Letters not corresponding to any amino acid will be treated as X.
Non alphabetical chars will cause the rejection of the query.
Only 1 letter amino acid code accepted. Please do not send nucleotide sequences. If so, A will be treated as Alanine, C as Cysteine, etc...

Output format

Replies are sent by email. SSpro, SSpro8, ACCpro, ACCpro20, DOMpro, DISpro, DIpro, CONpro, SOLpro, ANTIGENpro, and VIRALpro replies come as text, embedded in the body of the email. Here you have an example of prediction:

Name: short Amino Acids: MQIFVKTLTGKTITLEVEPSDTIENVKAKI Predicted Secondary Structure: CEEEEEEECCCEEEEEECCCCCHHHHHCCC Predicted Secondary Structure (8 Class): CEEEEEEEESEEEEEEECCCSHHHHEECCC Predicted Relative Solvent Accessiblity (at 25% exposed threshold): ee---ee-eeee-e-e-eeeee-ee-eeee Predicted Relative Solvent Accessiblity (20 Class): 0% eeeeeeeeeeeeeeeeeeeeeeeeeeeeee 5% eeeeeeeeeeeeeeeeeeeeeeeeeeeeee 10% eeeeeeeeeeeeeeeeeeeeeeeeeeeeee 15% eee--eeeeeee-e-eeeeeeeeeeeeeee 20% eee--ee-eeee-e-eeeeeeeeee-eeee 25% eee--ee-eeee-e-e-eeeeeeee-eeee 30% ee---ee-eeee-e-e-eeeee-ee-eeee 35% ee---ee-eeee-e-e-eeeee-ee-eeee 40% ee---ee-eeee-e-e-eeeee-ee-eeee 45% ee---e--eee----e---ee--ee-eeee 50% ee--------e--------e---ee-eeee 55% e----------------------e---eee 60% e--------------------------eee 65% e---------------------------ee 70% -----------------------------e 75% ----------------------------e 80% -----------------------------e 85% -----------------------------e 90% ------------------------------ 95% ------------------------------ 100% ------------------------------ Predicted Contact Number: ------------------------------ Predicted Disordered Residues: OOOOOOOOOOOOOOOOOOOOOOOOOOOODD Predicted Disorder Probability: 0.16 0.07 0.04 0.04 0.03 0.02 0.02 0.01 0.02 0.02 0.02 0.01 0.01 0.01 0.02 0.02 0.03 0.04 0.06 0.06 0.12 0.13 0.20 0.17 0.18 0.20 0.23 0.48 0.55 0.58 Predicted Domains: Domain 1: 1 - 30 Predicted Disulfide Bonds: Input sequence has LESS THAN TWO cysteins and therefore cannot form disulfide bonds. Predicted Contact Maps: SEE ATTACHMENTS Predicted Solubility upon Overexpression: SOLUBLE with probability 0.901803 Predicted Capsid/Tail Sequence: Capsid Sequence : YES (distance = 0.266294) Tail Sequence : NO (distance = -0.344029)

The predictions have the following meaning:

Amino Acids: The 1-letter code of your protein primary sequence. This line is always present.
SSpro: secondary structure prediction:
- H = helix
- E = strand
- C = the rest
SSpro8: 8-class secondary structure prediction:
- H: alpha-helix
- G: 310-helix
- I: pi-helix (extremely rare)
- E: extended strand
- B: beta-bridge
- T: turn
- S: bend
- C: the rest
ACCpro and ACCpro20: Prediction of relative solvent accessibility:
- - : the residue is buried
- e : the residue is exposed
CONpro: Predictions of number of residue contacts at 12Å relative to the amino acids average number of contacts:
- - : the residue has fewer contacts than average
- + : the residue has more contacts than average
DISpro: Predictions of number residue order/disorder:
- O : the residue is ordered
- D : the residue is disordered
In addition, probabilities of disorder are provided.
DOMpro: Prediction of domains:
Domain index: start residue - end residue
DIpro: Prediction of disulfide bonds:
DIpro makes two predictions. First, DIpro predicts if disulfide bonds exist in the protein. Second, DIpro predicts the location of the bonds. These predictions are made independent of each other. In this sample prediction, there are not enough cysteins to form a disulfide bond.
CMAPpro: Prediction of contact maps:
The predictions come as attached raw files, named contact_map.8a and contact_map.12a. If the query is N amino acids long the files are composed of N lines, each containing N space-separated real numbers. The j-th number on line i-th represents the estimated probability that amino acids i and j are in contact (i.e. of their C-αs being closer than the threshold).
3Dpro: Prediction of tertiary structure:
The predictions come in a separate email and contain a pdb file with the coordinates of the C-&alpha atoms.
SOLpro: Prediction of solubility upon overexpression:
SOLpro predicts if the protein is soluble or not (SOLUBLE/INSOLUBLE) and gives the corresponding probability (≥ 0.5).
VIRALpro: Prediction of capsid & tail proteins:
VIRALpro predicts if the input protein sequence is a capsid sequence or a tail sequence (YES/NO for each). The absolute value of the corresponding score is the distance to the decision boundary of the SVM and can be used for ranking purposes.

Note: Since CMAPpro and 3Dpro predictions are computationally intensive only proteins of length at most 400 amino acids will be accepted if CMAPpro or 3Dpro predictions are selected.

Return to SCRATCH

References

For the server and SSpro/ACCpro 4.0 software package, please refer to:

J. Cheng, A. Randall, M. Sweredoski, P. Baldi, SCRATCH: a Protein Structure and Structural Feature Prediction Server, Nucleic Acids Research, vol. 33 (web server issue), w72-76, 2005. [PDF] [PDF at NAR website] [Download SSpro/ACCpro 4.0]

For the previous version of the SSpro/ACCpro 5.2 software package, please refer to:

C.N. Magnan & P. Baldi, "SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity", Bioinformatics, 30 (18), 2592-2597, 2014.
Download PDF Abstract & HTML (Bioinformatics Website)

For an explanation of the methods used in SSpro and SSpro8 see:

G.Pollastri, D.Przybylski, B.Rost, P.Baldi, "Improving the Prediction of Protein Secondary Structure in Three and Eight Classes Using Recurrent Neural Networks and Profiles", Proteins, 47, 228-235, 2002.
Download PDF, Abstract and HTML (Proteins web site).

For an explanation of the methods used in ACCpro and CONpro see:

G.Pollastri, P.Baldi, P.Fariselli, R.Casadio, "Prediction of Coordination Number and Relative Solvent Accessibility in Proteins", Proteins, 47, 142-153, 2002.
Download PDF, Abstract and HTML (Proteins web site)

For an explanation of the methods used in DOMpro see:

J. Cheng, M. Sweredoski, P. Baldi, "DOMpro: Protein Domain Prediction Using Profiles, Secondary Structure, Relative Solvent Accessibility, and Recursive Neural Networks", Knowledge Discovery and Data Mining, vol. 13, no. 1, pp. 1-10, 2006.
Download PDF,

For an explanation of the methods used in DISpro see:

J. Cheng, M. Sweredoski, P. Baldi, "Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data", Data Mining and Knowledge Discovery, vol. 11, no. 3, pp. 213-222, 2005.
Download PDF , PDF at DAMI web site

For an explanation of the methods used in DIpro see:
J. Cheng, H. Saigo, P. Baldi, "Large-Scale Prediction of Disulphide Bridges Using Kernel Methods, Two-Dimensional Recursive Neural Networks, and Weighted Graph Matching", Proteins, vol. 62, no. 3, pp. 617-629, 2006. [PDF][PDF at Proteins website] [Download DIpro 2.0]
Or
P.Baldi, J. Cheng, A. Vullo, "Large-Scale Prediction of Disulphide Bond Connectivity", Advances in Neural Information Processing Systems (NIPS 2004) 17,L. Saul ,Y. Weiss, and L. Bottou editors, MIT press, pp.97-104, Cambridge, MA, 2005.
Download PDF,

For an explanation of the methods used in CMAPpro see:

P. Di Lena, K. Nagata, P. Baldi, "Deep Architectures for Protein Contact Map Prediction", Bioinformatics, 2012. In press
Download PDF,HTML abstract (Bioinformatics web site)

P. Di Lena, K. Nagata, P. Baldi, "Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction", Neural Information Processing Systems (NIPS), 2012. Accepted for presentation.

G.Pollastri, P.Baldi, "Prediction of Contact Maps by Recurrent Neural Network Architectures and Hidden Context Propagation from All Four Cardinal Corners", Bioinformatics, 18 Suppl 1, S62-S70 (2002).
Download PDF, HTML abstract (Bioinformatics web site).

For an explanation of the methods used in SVMcon see:

J. Cheng and P. Baldi. "Improved Residue Contact Prediction Using Support Vector Machines and a Large Feature Set." BMC Bioinfomatics. 8:113, 2007.
Download [Download SVMcon 1.0]

For an explanation of the methods used in COBEpro:

Michael J. Sweredoski and Pierre Baldi. "COBEpro: a novel system for predicting continuous B-cell epitopes." Protein Engineering Design and Selection 2008; doi: 10.1093/protein/gzn075
Download PDF, HTML (Bioinformatics website)

Return to SCRATCH