Transformer protein language models are unsupervised structure learners

Roshan Rao,Joshua Meier,Tom Sercu,S. Ovchinnikov,Alexander Rives

Published 2020 in bioRxiv

ABSTRACT

Unsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.1

PUBLICATION RECORD

Publication year
2020
Venue
bioRxiv
Publication date
2020-12-15
Fields of study
Biology, Computer Science
Identifiers
DOI 10.1101/2020.12.15.422761
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning
2021cited by this paper
Improving Generalizability of Protein Sequence Models with Data Augmentations
2021cited by this paper
Improved protein structure prediction by deep learning irrespective of co-evolution information
2020cited by this paper
Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization
2020cited by this paper
Improved protein structure prediction using potentials from deep learning
2020cited by this paper
An evolution-based model for designing chorismate mutase enzymes
2020cited by this paper
ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing
2020cited by this paper
BERTology Meets Biology: Interpreting Attention in Protein Language Models
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
Generating functional protein variants with variational autoencoders
2020cited by this paper
Scaling Laws for Neural Language Models
2020cited by this paper
ProGen: Language Modeling for Protein Generation
2020cited by this paper
BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model
2019cited by this paper
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
2019cited by this paper
Synthetic sequence entanglement augments stability and containment of genetic information in cells
2019cited by this paper
RoBERTa: A Robustly Optimized BERT Pretraining Approach
2019cited by this paper
Evaluating Protein Transfer Learning with TAPE
2019influential reference
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
2019cited by this paper
Modeling the language of life – Deep Learning Protein Sequences
2019cited by this paper
Unified rational protein engineering with sequence-based deep representation learning
2019influential reference
DEEPCON: Protein Contact Prediction using Dilated Convolutional Neural Networks with Dropout
2019cited by this paper
Learning protein sequence embeddings using information from structure
2019cited by this paper
Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13)
2019cited by this paper
Improved protein structure prediction using predicted interresidue orientations
2019influential reference
DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins
2019cited by this paper
NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning
2018cited by this paper
Innenrücktitelbild: Co‐Evolutionary Fitness Landscapes for Sequence Design (Angew. Chem. 20/2018)
2018cited by this paper
High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features
2018cited by this paper
Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold
2018cited by this paper
Deep generative models of genetic variation capture the effects of mutations
2018cited by this paper
Stability
2018cited by this paper
End-to-end differentiable learning of protein structure
2018cited by this paper
Learning Protein Structure with a Differentiable Simulator
2018cited by this paper
Sequence statistics of tertiary structural motifs reflect protein stability
2017cited by this paper
mixOmics: An R package for ‘omics feature selection and multiple data integration
2017cited by this paper
Origins of coevolution between residues distant in protein 3D structures
2017cited by this paper
Attention is All you Need
2017influential reference
Enhancing Evolutionary Couplings with Deep Convolutional Neural Networks
2017cited by this paper
Protein Residue Contacts and Prediction Methods.
2016cited by this paper
Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model
2016influential reference
Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta
2016influential reference
Predicting protein functions using incomplete hierarchical labels
2015cited by this paper
Assessment of CASP10 contact‐assisted predictions
2014cited by this paper
De Novo Structure Prediction of Globular Proteins Aided by Sequence Variation-Derived Contacts
2014cited by this paper
Capturing coevolutionary signals inrepeat proteins
2014cited by this paper
CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations
2014cited by this paper
A Selected Core Microbiome Drives the Early Stages of Three Popular Italian Cheese Manufactures
2014cited by this paper
Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era
2013cited by this paper
Genomics-aided structure prediction
2012cited by this paper
Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models.
2012influential reference
PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments
2012cited by this paper
Simulation physiologischer Regelkreise mit der objektorientierten Modellbibliothek “HumanLib”
2011cited by this paper
Learning generative models for protein fold families
2011influential reference
Protein 3D Structure Computed from Evolutionary Sequence Variation
2011cited by this paper
Direct-coupling analysis of residue coevolution captures native contacts across many protein families
2011cited by this paper
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment
2011cited by this paper
Scikit-learn: Machine Learning in Python
2011cited by this paper
E2F1 Regulates Cellular Growth by mTORC1 Signaling
2011cited by this paper
Hidden Markov model speed heuristic and iterative HMM search procedure
2010cited by this paper
Identification of direct residue contacts in protein–protein interaction by message passing
2009cited by this paper
Graphical Models of Residue Coupling in Protein Families
2008cited by this paper
Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction
2008cited by this paper
UniRef: comprehensive and non-redundant UniProt reference clusters
2007cited by this paper
The Pfam protein families database
2007cited by this paper
Graphical models of residue coupling in protein families
2005cited by this paper
Evaluation and improvement of multiple sequence methods for protein secondary structure prediction
1999influential reference
Correlated mutations in models of protein sequences: phylogenetic and structural effects
1999cited by this paper

CITED BY

Rapid sequence-based screening of structure-disrupting protein mutations
2026cites this paper
Protein Language Models: Is Scaling Necessary?
2026cites this paper
ProteomeLM: A proteome-scale language model enables accurate and rapid prediction of protein-protein interactions and gene essentiality across taxa
2026cites this paper
CrossAffinity: A Sequence-Based Protein-Protein Binding Affinity Prediction Tool Using Cross-Attention Mechanism
2026cites this paper
Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models
2026cites this paper
Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization
2026cites this paper
Deep models of protein evolution in time generate realistic evolutionary trajectories and functional proteins
2026cites this paper
AbAffinity: A Large Language Model for Predicting Antibody Binding Affinity against SARS-CoV-2
2026cites this paper
SenSeqNet: A Deep Learning Framework for Cellular Senescence Detection From Protein Sequences
2025influential citation
Stable de novo protein design via joint conformational landscape and sequence optimization
2025cites this paper
De novo design of epitope-specific antibodies via a structure-driven computational workflow
2025cites this paper
GUANinE v1.1 Reveals Complementarity of Supervised and Genomic Language Models
2025cites this paper
Inferred global dense residue transition graphs from primary structure sequences enable protein interaction prediction via directed graph convolutional neural networks
2025cites this paper
Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability
2025cites this paper
ProtChatGPT: Towards Understanding Proteins with Hybrid Representation and Large Language Models
2025cites this paper
RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models
2025cites this paper
BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining
2025cites this paper
Pi-SAGE: Permutation-invariant surface-aware graph encoder for binding affinity prediction
2025cites this paper
Separating selection from mutation in antibody language models
2025cites this paper
Improving prediction accuracy in chimeric proteins with windowed multiple sequence alignment
2025cites this paper
A Little Help Goes a Long Way: Tutoring LLMs in Solving Competitive Programming Through Hints
2025cites this paper
Evolutionary discovery and characterization of fungal transcriptional activators using active learning
2025cites this paper
DGTN: Graph-Enhanced Transformer with Diffusive Attention Gating Mechanism for Enzyme DDG Prediction
2025cites this paper
Mechanistic Interpretability of Fine-Tuned Protein Language Models for Nanobody Thermostability Prediction
2025cites this paper
An unsupervised framework for comparing SARS-CoV-2 protein sequences using LLMs
2025cites this paper
End-to-End Deep Learning for Enhanced Epitope Classification in Vaccine Design
2025cites this paper
Fine-tuned protein language model identifies antigen-specific B cell receptors from immune repertoires
2025cites this paper
Phage evolutionary relationships emerge from protein language model-based proteome representation
2025cites this paper
Predictive and therapeutic applications of protein language models.
2025cites this paper
BEST: Basic Embedding Search Tool Enhancing Discovery of Novel Enzyme
2025cites this paper
Towards foundation models that learn across biological scales
2025influential citation
Supervised fine-tuning of pre-trained antibody language models improves antigen specificity prediction
2025cites this paper
PLM-ATG: Identification of Autophagy Proteins by Integrating Protein Language Model Embeddings with PSSM-Based Features
2025cites this paper
Transformers in Protein: A Survey
2025cites this paper
EvoNB: A protein language model-based workflow for nanobody mutation prediction and optimization
2025cites this paper
Prot42: a Novel Family of Protein Language Models for Target-aware Protein Binder Generation
2025influential citation
Drug-Target Affinity Prediction Based on Graph Representation and Attention Fusion Mechanism
2025cites this paper
Gene finding revisited: improved robustness through structured decoding from learned embeddings
2025cites this paper
Do Protein Transformers Have Biological Intelligence?
2025cites this paper
Squidly: Enzyme Catalytic Residue Prediction Harnessing a Biology-Informed Contrastive Learning Framework
2025cites this paper
Automated Neuron Labelling Enables Generative Steering and Interpretability in Protein Language Models
2025cites this paper
Toward the Explainability of Protein Language Models
2025cites this paper
ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders
2025cites this paper
A survey of downstream applications of evolutionary scale modeling protein language models
2025cites this paper
Evaluating data partitioning strategies for accurate prediction of protein-ligand binding free energy changes in mutated proteins
2025cites this paper
Mechanistic evidence that motif-gated domain recognition drives contact prediction in protein language models
2025cites this paper
Prioritizing Stability-enhancing Mutations using a Protein Language Model in conjunction with Physics-Based Predictions
2025cites this paper
Uncovering Hierarchical Structure in LLM Embeddings with δ-Hyperbolicity, Ultrametricity, and Neighbor Joining
2025cites this paper
Phylogenetic Corrections and Higher-Order Sequence Statistics in Protein Families: The Potts Model vs MSA Transformer
2025cites this paper
Towards Interpretable Protein Structure Prediction with Sparse Autoencoders
2025cites this paper
Cutting-edge deep-learning based tools for metagenomic research
2025cites this paper
A Large Language Model Guides the Affinity Maturation of Antibodies Generated by Combinatorial Optimization Algorithms
2025cites this paper
Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction
2025cites this paper
Are protein language models the new universal key?
2025cites this paper
TransConv: convolution-infused transformer for protein secondary structure prediction
2025influential citation
Advancing protein structure prediction beyond AlphaFold2.
2025cites this paper
Foundation models of protein sequences: A brief overview.
2025cites this paper
PUMA: Discovery of Protein Units via Mutation-Aware Merging
2025cites this paper
Computational Protein Science in the Era of Large Language Models (LLMs)
2025cites this paper
Accelerating antibody development: sequence and structure-based models for predicting developability properties via size exclusion chromatography
2025cites this paper
Predicting antibiotic resistance genes and bacterial phenotypes based on protein language models
2025cites this paper
From Prediction to Simulation: AlphaFold 3 as a Differentiable Framework for Structural Biology
2025influential citation
Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models
2025cites this paper
Subcellular Enrichment Patterns of New Genes in Drosophila Evolution
2025influential citation
LoRA-DR-suite: adapted embeddings predict intrinsic and soft disorder from protein sequences
2025cites this paper
From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models
2025cites this paper
Teaching AI to speak protein.
2025cites this paper
Inferring context-specific site variation with evotuned protein language models
2025cites this paper
S2-PepAnalyst: A Web Tool for Predicting Plant Small Signalling Peptides
2025cites this paper
Language models for protein design.
2025cites this paper
SST-ResNet: A Sequence and Structure Information Integration Model for Protein Property Prediction
2025cites this paper
Inclusive STEAM Education: A Framework for Teaching Cod-2 ing and Robotics to Students with Visually Impairment Using 3 Advanced Computer Vision
2025cites this paper
HCAF-DTA: drug-target binding affinity prediction with cross-attention fused hypergraph neural networks
2025cites this paper
Advancements in one-dimensional protein structure prediction using machine learning and deep learning
2025cites this paper
ProtRNA: A Protein-derived RNA Language Model by Cross-Modality Transfer Learning
2025cites this paper
Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs
2025cites this paper
PPI-Graphomer: enhanced protein-protein affinity prediction using pretrained and graph transformer models
2025cites this paper
Bioinfo-Bench: A Simple Benchmark Framework for LLM Bioinformatics Skills Evaluation
2025cites this paper
Extending Protein Language Models to a Viral Genomic Scale Using Biologically Induced Sparse Attention
2025cites this paper
Leveraging Natural Language Processing to Unravel the Mystery of Life: A Review of NLP Approaches in Genomics, Transcriptomics, and Proteomics
2025cites this paper
Decoding the interactions and functions of non-coding RNA with artificial intelligence
2025cites this paper
Nucleotide context models outperform protein language models for predicting antibody affinity maturation
2025cites this paper
Top-DTI: integrating topological deep learning and large language models for drug–target interaction prediction
2025cites this paper
Directed Evolution of Proteins via Bayesian Optimization in Embedding Space
2024cites this paper
Protein Sequence Domain Annotation using Language Models
2024cites this paper
MSA Generation with Seqs2Seqs Pretraining: Advancing Protein Structure Predictions
2024cites this paper
Boosting Protein Language Models with Negative Sample Mining
2024cites this paper
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
2024cites this paper
Transformer models for astrophysical time series and the GRB prompt-afterglow relation
2024cites this paper
On Recovering Higher-order Interactions from Protein Language Models
2024cites this paper
Supervised fine-tuning of pre-trained antibody language models improves antigen specificity prediction
2024cites this paper
HieVi: Protein Large Language Model for proteome-based phage clustering
2024cites this paper
Range-limited Heaps' law for functional DNA words in the human genome.
2024cites this paper
Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity
2024cites this paper
VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool
2024cites this paper
On the Scalability of GNNs for Molecular Graphs
2024cites this paper
Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models
2024influential citation
Open-Source Protein Language Models for Function Prediction and Protein Design
2024cites this paper
Improvements in viral gene annotation using large language models and soft alignments
2024cites this paper
Enhancing predictions of protein stability changes induced by single mutations using MSA-based language models
2024cites this paper