Nanjiang Shu

Nanjiang Shu


Visa sidan på svenska
Works at Department of Biochemistry and Biophysics
Visiting address SciLifeLab, Tomtebodavägen 23a, 171 65 Solna
Postal address Institutionen för biokemi och biofysik 106 91 Stockholm


A selection from Stockholm University publication database
  • Karolis Uziela (et al.).

    Protein modeling quality is an important part of protein structure prediction. We have for more than a decade developed a set of methods using various types of protein descriptions and machine learning methods. Common to all these methods has been that the target function, i.e. the description of the quality of a residue in a protein model, has been the S-score. However, many other quality estimation functions also exist. These can roughly be divided into superposition, like S-score, and contact-based functions. The contact-based methods have been shown to be better at evaluating the quality of multi-domain proteins.

    Here, we examine the effects of retraining ProQ3D using identical inputs but different target functions. We find that using the same target and test function provides the best agreement. However using contact-based methods provide higher correlations and a better ranking of individual models.

  • Stefano Pascarelli (et al.).

    Motivation: Detection of homologous sequences is a the basis formany bioinformatics applications. Position-Specific Scoring Matrices(PSSMs) or Hidden Markov Models (HMMs) are often created fromthe detected homologous sequences. These are then widely usedin many bioinformatics software in order to incorporate evolutionaryinformation in the prediction process. However, due to the increasein the size of reference databases, there is a continuous decrease inspeed of homology detection even with faster computers.Results: By using PRODRES, we save on average X percent ofthe search time. This pipeline has been exploited in our widely usedtopology prediction software, TOPCONS. In total, more than 5 millionPSSMs have been generated, with an average running time of about1 minute. This corresponds to an approximate 10 times speed-up ofthe whole process.Availability and implementation: A standalone version ofPRODRES can be found in the Github repository, while a web-server implementing themethod is available for academic users at

  • 2018. Karolis Uziela (et al.). Proteins 86 (6), 654-663

    Protein modeling quality is an important part of protein structure prediction. We have for more than a decade developed a set of methods for this problem. We have used various types of description of the protein and different machine learning methodologies. However, common to all these methods has been the target function used for training. The target function in ProQ describes the local quality of a residue in a protein model. In all versions of ProQ the target function has been the S-score. However, other quality estimation functions also exist, which can be divided into superposition- and contact-based methods. The superposition-based methods, such as S-score, are based on a rigid body superposition of a protein model and the native structure, while the contact-based methods compare the local environment of each residue. Here, we examine the effects of retraining our latest predictor, ProQ3D, using identical inputs but different target functions. We find that the contact-based methods are easier to predict and that predictors trained on these measures provide some advantages when it comes to identifying the best model. One possible reason for this is that contact based methods are better at estimating the quality of multi-domain targets. However, training on the S-score gives the best correlation with the GDT_TS score, which is commonly used in CASP to score the global model quality. To take the advantage of both of these features we provide an updated version of ProQ3D that predicts local and global model quality estimates based on different quality estimates.

  • 2018. Marco Salvatore, Nanjiang Shu, Arne Elofsson. Protein Science 27 (1), 195-201

    SubCons is a recently developed method that predicts the subcellular localization of a protein. It combines predictions from four predictors using a Random Forest classifier. Here, we present the user-friendly web-interface implementation of SubCons. Starting from a protein sequence, the server rapidly predicts the subcellular localizations of an individual protein. In addition, the server accepts the submission of sets of proteins either by uploading the files or programmatically by using command line WSDL API scripts. This makes SubCons ideal for proteome wide analyses allowing the user to scan a whole proteome in few days. From the web page, it is also possible to download precalculated predictions for several eukaryotic organisms. To evaluate the performance of SubCons we present a benchmark of LocTree3 and SubCons using two recent mass-spectrometry based datasets of mouse and drosophila proteins. The server is available at

  • 2018. Konstantinos D. Tsirigos (et al.). Current opinion in structural biology 50, 9-17

    Transmembrane proteins perform a variety of important biological functions necessary for the survival and growth of the cells. Membrane proteins are built up by transmembrane segments that span the lipid bilayer. The segments can either be in the form of hydrophobic alpha-helices or beta-sheets which create a barrel. A fundamental aspect of the structure of transmembrane proteins is the membrane topology, that is, the number of transmembrane segments, their position in the protein sequence and their orientation in the membrane. Along these lines, many predictive algorithms for the prediction of the topology of alpha-helical and beta-barrel transmembrane proteins exist. The newest algorithms obtain an accuracy close to 80% both for alpha-helical and beta-barrel transmembrane proteins. However, lately it has been shown that the simplified picture presented when describing a protein family by its topology is limited. To demonstrate this, we highlight examples where the topology is either not conserved in a protein superfamily or where the structure cannot be described solely by the topology of a protein. The prediction of these nonstandard features from sequence alone was not successful until the recent revolutionary progress in 3D-structure prediction of proteins.

  • Article ProQ3D
    2017. Karolis Uziela (et al.). Bioinformatics 33 (10), 1578-1580

    Protein quality assessment is a long-standing problem in bioinformatics. For more than a decade we have developed state-of-art predictors by carefully selecting and optimising inputs to a machine learning method. The correlation has increased from 0.60 in ProQ to 0.81 in ProQ2 and 0.85 in ProQ3 mainly by adding a large set of carefully tuned descriptions of a protein. Here, we show that a substantial improvement can be obtained using exactly the same inputs as in ProQ2 or ProQ3 but replacing the support vector machine by a deep neural network. This improves the Pearson correlation to 0.90 (0.85 using ProQ2 input features).

  • Article SubCons
    2017. Marco Salvatore (et al.). Bioinformatics 33 (16), 2464-2470

    Motivation: Knowledge of the correct protein subcellular localization is necessary for understanding the function of a protein. Unfortunately large-scale experimental studies are limited in their accuracy. Therefore, the development of prediction methods has been limited by the amount of accurate experimental data. However, recently large-scale experimental studies have provided new data that can be used to evaluate the accuracy of subcellular predictions in human cells. Using this data we examined the performance of state of the art methods and developed SubCons, an ensemble method that combines four predictors using a Random Forest classifier. Results: SubCons outperforms earlier methods in a dataset of proteins where two independent methods confirm the subcellular localization. Given nine subcellular localizations, SubCons achieves an F1-Score of 0.79 compared to 0.70 of the second bestmethod. Furthermore, at a FPR of 1% the true positive rate (TPR) is over 58% for SubCons compared to less than 50% for the best individual predictor.

  • 2016. Christoph Peters (et al.). Bioinformatics 32 (8), 1158-1162

    Motivation: The translocon recognizes sufficiently hydrophobic regions of a protein and inserts them into the membrane. Computational methods try to determine what hydrophobic regions are recognized by the translocon. Although these predictions are quite accurate, many methods still fail to distinguish marginally hydrophobic transmembrane (TM) helices and equally hydrophobic regions in soluble protein domains. In vivo, this problem is most likely avoided by targeting of the TM-proteins, so that non-TM proteins never see the translocon. Proteins are targeted to the translocon by an N-terminal signal peptide. The targeting is also aided by the fact that the N-terminal helix is more hydrophobic than other TM-helices. In addition, we also recently found that the C-terminal helix is more hydrophobic than central helices. This information has not been used in earlier topology predictors.

    Results: Here, we use the fact that the N- and C-terminal helices are more hydrophobic to develop a new version of the first-principle-based topology predictor, SCAMPI. The new predictor has two main advantages; first, it can be used to efficiently separate membrane and non-membrane proteins directly without the use of an extra prefilter, and second it shows improved performance for predicting the topology of membrane proteins that contain large non-membrane domains.

    Availability and implementation: The predictor, a web server and all datasets are available at

  • 2016. Sikander Hayat (et al.). Bioinformatics 32 (10), 1571-1573

    Accurate topology prediction of transmembrane beta-barrels is still an open question. Here, we present BOCTOPUS2, an improved topology prediction method for transmembrane beta-barrels that can also identify the barrel domain, predict the topology and identify the orientation of residues in transmembrane beta-strands. The major novelty of BOCTOPUS2 is the use of the dyad-repeat pattern of lipid and pore facing residues observed in transmembrane beta-barrels. In a cross-validation test on a benchmark set of 42 proteins, BOCTOPUS2 predicts the correct topology in 69% of the proteins, an improvement of more than 10% over the best earlier method (BOCTOPUS) and in addition, it produces significantly fewer erroneous predictions on non-transmembrane beta-barrel proteins.

  • Article ProQ3
    2016. Karolis Uziela (et al.). Scientific Reports 6

    Quality assessment of protein models using no other information than the structure of the model itself has been shown to be useful for structure prediction. Here, we introduce two novel methods, ProQRosFA and ProQRosCen, inspired by the state-of-art method ProQ2, but using a completely different description of a protein model. ProQ2 uses contacts and other features calculated from a model, while the new predictors are based on Rosetta energies: ProQRosFA uses the full-atom energy function that takes into account all atoms, while ProQRosCen uses the coarse-grained centroid energy function. The two new predictors also include residue conservation and terms corresponding to the agreement of a model with predicted secondary structure and surface area, as in ProQ2. We show that the performance of these predictors is on par with ProQ2 and significantly better than all other model quality assessment programs. Furthermore, we show that combining the input features from all three predictors, the resulting predictor ProQ3 performs better than any of the individual methods. ProQ3, ProQRosFA and ProQRosCen are freely available both as a webserver and stand-alone programs at

  • 2015. Konstantinos D. Tsirigos (et al.). Nucleic Acids Research 43 (W1), W401-W407

    TOPCONS ( is a widely used web server for consensus prediction of membrane protein topology. We hereby present a major update to the server, with some substantial improvements, including the following: (i) TOPCONS can now efficiently separate signal peptides from transmembrane regions. (ii) The server can now differentiate more successfully between globular and membrane proteins. (iii) The server now is even slightly faster, although a much larger database is used to generate the multiple sequence alignments. For most proteins, the final prediction is produced in a matter of seconds. (iv) The user-friendly interface is retained, with the additional feature of submitting batch files and accessing the server programmatically using standard interfaces, making it thus ideal for proteome-wide analyses. Indicatively, the user can now scan the entire human proteome in a few days. (v) For proteins with homology to a known 3D structure, the homology-inferred topology is also displayed. (vi) Finally, the combination of methods currently implemented achieves an overall increase in performance by 4% as compared to the currently available best-scoring methods and TOPCONS is the only method that can identify signal peptides and still maintain a state-of-the-art performance in topology predictions.

  • 2014. Minttu Virkki (et al.). Journal of Molecular Biology 426 (13), 2529-2538

    While early structural models of helix-bundle integral membrane proteins posited that the transmembrane a-helices [transmembrane helices (TMHs)] were orientated more or less perpendicular to the membrane plane, there is now ample evidence from high-resolution structures that many TMHs have significant tilt angles relative to the membrane. Here, we address the question whether the tilt is an intrinsic property of the TMH in question or if it is imparted on the TMH during folding of the protein. Using a glycosylation mapping technique, we show that four highly tilted helices found in multi-spanning membrane proteins all have much shorter membrane-embedded segments when inserted by themselves into the membrane than seen in the high-resolution structures. This suggests that tilting can be induced by tertiary packing interactions within the protein, subsequent to the initial membrane-insertion step.

  • Article KalignP
    2011. Nanjiang Shu, Arne Elofsson. Bioinformatics 27 (12), 1702-1703

    Kalign2 is one of the fastest and most accurate methods for multiple alignments. However, in contrast to other methods Kalign2 does not allow externally supplied position specific gap penalties. Here, we present a modification to Kalign2, KalignP, so that it accepts such penalties. Further, we show that KalignP using position specific gap penalties obtained from predicted secondary structures makes steady improvement over Kalign2 when tested on Balibase 3.0 as well as on a dataset derived from Pfam-A seed alignments.

  • 2010. Tuping Zhou, Nanjiang Shu, Sven Hovmöller. Bioinformatics 26 (4), 470-477

    Motivation: The precise prediction of one-dimensional (1D) protein structure as represented by the protein secondary structure and 1D string of discrete state of dihedral angles (i.e. Shape Strings) is a prerequisite for the successful prediction of three-dimensional (3D) structure as well as protein-protein interaction. We have developed a novel 1D structure prediction method, called Frag1D, based on a straightforward fragment matching algorithm and demonstrated its success in the prediction of  three sets of 1D structural alphabets, i.e. the classical three-state secondary structure, three-state Shape Strings and eight-state Shape Strings.

    Results: By exploiting the vast protein sequence and protein structure data available, we have brought secondary structure prediction closer to the expected theoretical limit. When tested by a leave-one-out cross validation on a non-redundant set of PDB cutting at 30% sequence identity containing 5860 protein chains, the overall per-residue accuracy for secondary structure prediction, i.e. Q3 is 82.9%. The overall per-residue accuracy for three-state and eight-state Shape Strings are 85.1% and 71.5% respectively. We have also benchmarked our program with the latest version of PSIPRED for secondary structure prediction and our program predicted 0.3% better in Q3 when tested on 2241 chains with the same training set. For Shape Strings, we compared our method with a recently published method with the same dataset and definition as used by that method. Our program predicted at 2.2% better in accuracy for three-state Shape Strings. By quantitatively investigating the effect of data base size on 1D structure prediction we show that the accuracy increases by about 1% with every doubling of the database size.

  • 2010. Nanjiang Shu (et al.).

    Predicting the three-dimensional (3D) structure of proteins is a central problem in biology. These computationally predicted 3D protein structures have been successfully applied in many fields of biomedicine, e.g. family assignments and drug discovery. The accurate detection of remotely homologous templates is critical for the successful prediction of the 3D structure of proteins. Also, the prediction of one-dimensional (1D) protein structures such as secondary structures and shape strings are useful for predicting the 3D structure of proteins and important for understanding the sequence-structure relationship. In addition, the prediction of the functional sites of proteins, such as metal-binding sites, can not only reveal the important function of proteins (even in the absence of the 3D structure) but also facilitate the prediction of the 3D structure.

    Here, three novel methods in the field of protein structure prediction are presented: PREDZINC, a method for predicting zinc-binding sites in proteins; Frag1D, a method for predicting the 1D structure of proteins; and FragMatch, a method for detecting remotely homologous proteins. These methods compete satisfactorily with the best methods previously published and contribute to the task of protein structure prediction.

  • 2008. Nanjiang Shu, Sven Hovmöller, Tuping Zhou. Current protein and peptide science 9 (4), 310-324

    Different methods for describing and comparing the structures of the tens of thousands of proteins that have been determined by X-ray crystallography are reviewed. Such comparisons are important for understanding the structures and functions of proteins and facilitating structure prediction, as well as assessing structure prediction methods. We summarize methods in this field emphasizing ways of representing protein structures as one-dimensional geometrical strings. Such strings are based on the shape symbols of clustered regions of φ/Ψ dihedral angle pairs of the polypeptide backbones as described by the Ramachandran plot. These one-dimensional expressions are as compact as secondary structure description but contain more information in loop regions. They can be used for fast searching for similar structures in databases and for comparing similarities between proteins and between the predicted and native structures.

  • 2008. Nanjiang Shu (et al.).

    A large number of proteins require certain metals to stabilize their structures or to function properly. About one third of all proteins in the Protein Data Bank (PDB) contain metals and it is estimated that approximately the same proportion of all proteins are metalloproteins.

    Zinc, the second most abundant transition metal found in eukaryotic organisms, plays key roles, mainly structural and catalytic, in many biological functions. Predicting whether a protein binds zinc and even the accurate location of binding sites is important when investigating the function of an experimentally uncharacterized protein.

    Describing and comparing protein structures with both efficiency and accuracy are essential for systematic annotation of functional properties of proteins, be it on an individual or on a genome scale. Dozens of structure comparison methods have been developed in the past decades. In recent years, several research groups have endeavoured in developing methods for fast comparison of protein structures by representing the three-dimensional (3D) protein structures as one-dimensional (1D) geometrical strings based on the shape symbols of clustered regions of φ/ψ torsion angle pairs of the polypeptide backbones. These 1D geometrical strings, shape strings, are as compact as 1D secondary structures but carry more elaborate structural information in loop regions and thus are more suitable for fast structure database searching, classification of loop regions and evaluation of model structures.

    In this thesis, a new method for predicting zinc-binding sites in proteins from amino acid sequences is described. This method predicts zinc-binding Cys, His, Asp and Glu (the four most common zinc-binding residues) with 75% precision (86% for Cys and His only) at 50% recall according to a solid 5-fold cross-validation on a non-redundant set of the PDB chains containing 2727 unique chains, of which 235 bind to zinc. This method predicts zinc-binding Cys and His with about 10% higher precision at different recall levels compared to a previously published method. In addition, different methods for describing and comparing protein structures are reviewed. Some recently developed methods based on 1D geometrical representation of backbone structures are emphasized and analyzed in details.

  • 2008. Nanjiang Shu, Tuping Zhou, Sven Hovmöller. Bioinformatics 24 (6), 775-782

    MOTIVATION: Motivated by the abundance, importance and unique functionality of zinc, both biologically and physiologically, we have developed an improved method for the prediction of zinc-binding sites in proteins from their amino acid sequences. RESULTS: By combining support vector machine (SVM) and homology-based predictions, our method predicts zinc-binding Cys, His, Asp and Glu with 75% precision (86% for Cys and His only) at 50% recall according to a 5-fold cross-validation on a non-redundant set of protein chains from the Protein Data Bank (PDB) (2727 chains, 235 of which bind zinc). Consequently, our method predicts zinc-binding Cys and His with 10% higher precision at different recall levels compared to a recently published method when tested on the same dataset. AVAILABILITY: The program is available for download at

Show all publications by Nanjiang Shu at Stockholm University

Last updated: December 10, 2018

Bookmark and share Tell a friend