Arne ElofssonProfessor of Bioinformatics
I am a professor in Bioinformatics.
I teach a course in bioinformatics, which is a part of the master in molecular life sciences program, as well as a stand alone distance based course at Stocholm university.
Predicting protein-protein interactions using AI
Combining large scale life-science data with artificial intelligence is crucial for the continued progression of our understanding of the molecular processes that govern life. We propose developing novel deep-learning methods to provide an unprecedented accurate description of the human proteome. Central to the methods we will develop is that they can efficiently utilise both annotated and unannotated biological data using self-supervision. We bring together four groups with complementary skills with the ultimate goal of providing an unprecedented detailed description of the human proteomic landscape. Starting from a novel set of long-read transcripts combined with existing large-scale proteomic analysis, we will first use novel machine learning methods to identify proteoforms, i.e. splice variants and post-translational modifications. These proteoforms will then be the basis for the identification of protein-protein interactions. Two approaches will be developed; first, language models will identify permanent and transient protein-protein interactions from the proteoforms. Secondly, we will use structural modelling by Alphafold to provide atomistic models of interacting protein pairs. Finally, the project will provide atomistic models of these proteoforms and their interaction partners, including large complexes. We will validate these complexes using native mass spectrometry. We believe that this project is extremely timely, given that AlphaFold was the scientific breakthrough in Science 2021 and the method of the year in Nature Methods, and that KAW has invested heavily in data-driven life science (DDLS) and compute power (Berzelius computer at NSC)
We are applying this philosophy primarily to the studies of two very important classes of proteins: transmembrane proteins and repeat domain containing proteins. Both classes of proteins are important drug-targets and/or central to diseases. Our studies use a broad range of techniques. We are primarily using bioinformatics and other computational methods, but also to an increasing fraction biochemical and other experimental techniques.
Saman Hosseini Ashtiani, PhD student
Patrick Bryant, PhD student
Oxana Lundström, PhD student
Gabriele Pozzati, PhD student
Aditi Shenoy, PhD student
Wensi Zhu, PhD student
VR-NT, KAW, EU
A selection from Stockholm University publication database
Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families
2021. Claudio Bassot, Arne Elofsson. PloS Computational Biology 17 (4)Article
Repeat proteins are widespread among organisms and particularly abundant in eukaryotic proteomes. Their primary sequence presents repetition in the amino acid sequences that origin structures with repeated folds/domains. Although the repeated units often can be recognised from the sequence alone, often structural information is missing. Here, we used contact prediction for predicting the structure of repeats protein directly from their primary sequences. We benchmark the methods on a dataset comprehensive of all the known repeated structures. We evaluate the contact predictions and the obtained models for different classes of repeat proteins. Further, we develop and benchmark a quality assessment (QA) method specific for repeat proteins. Finally, we used the prediction pipeline for all PFAM repeat families without resolved structures and found that forty-one of them could be modelled with high accuracy. Repeat proteins are abundant in eukaryotic proteomes. They are involved in many eukaryotic specific functions, including signalling. For many of these proteins, the structure is not known, as they are difficult to crystallise. Today, using direct coupling analysis and deep learning it is often possible to predict a protein's structure. However, the unique sequence features present in repeat proteins have been a challenge to use direct coupling analysis for predicting contacts. Here, we show that deep learning-based methods (trRosetta, DeepMetaPsicov (DMP) and PconsC4) overcomes this problem and can predict intra- and inter-unit contacts in repeat proteins. In a benchmark dataset of 815 repeat proteins, about 90% can be correctly modelled. Further, among 48 PFAM families lacking a protein structure, we produce models of forty-one families with estimated high accuracy.
The evolutionary history of topological variations in the CPA/AT transporters
2021. Govindarajan Sudha (et al.). PloS Computational Biology 17 (8)Article
CPA/AT transporters are made up of scaffold and a core domain. The core domain contains two non-canonical helices (broken or reentrant) that mediate the transport of ions, amino acids or other charged compounds. During evolution, these transporters have undergone substantial changes in structure, topology and function. To shed light on these structural transitions, we create models for all families using an integrated topology annotation method. We find that the CPA/AT transporters can be classified into four fold-types based on their structure; (1) the CPA-broken fold-type, (2) the CPA-reentrant fold-type, (3) the BART fold-type, and (4) a previously not described fold-type, the Reentrant-Helix-Reentrant fold-type. Several topological transitions are identified, including the transition between a broken and reentrant helix, one transition between a loop and a reentrant helix, complete changes of orientation, and changes in the number of scaffold helices. These transitions are mainly caused by gene duplication and shuffling events. Structural models, topology information and other details are presented in a searchable database, CPAfold (cpafold.bioinfo.se). Author summary The availability of experimentally solved transmembrane transport structures are sparse, and modelling is challenging as the families contain non-canonical transmembrane helices. Here, we present structural models for all families of CPA/AT transporters. These proteins are then classified into four fold-types, including one novel fold-type, the reentrant-helix-reentrant fold type. We find extensive structural variations within the fold with members having from three to fourteen transmembrane helices. We explore the evolutionary mechanisms that have shaped the topological variations providing a deeper understanding of membrane protein structure and evolution. We also believe our work could serve as a model system to understand the evolution of topology variations for other membrane proteins.
2021. Federico Baldassarre (et al.). Bioinformatics 37 (3), 360-366Article
Motivation: Proteins are ubiquitous molecules whose function in biological processes is determined by their 3D structure. Experimental identification of a protein's structure can be time-consuming, prohibitively expensive and not always possible. Alternatively, protein folding can be modeled using computational methods, which however are not guaranteed to always produce optimal results. GraphQA is a graph-based method to estimate the quality of protein models, that possesses favorable properties such as representation learning, explicit modeling of both sequential and 3D structure, geometric invariance and computational efficiency.
Results: GraphQA performs similarly to state-of-the-art methods despite using a relatively low number of input features. In addition, the graph network structure provides an improvement over the architecture used in ProQ4 operating on the same input features. Finally, the individual contributions of GraphQA components are carefully evaluated.