Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks (2012.03084v1)

Published 5 Dec 2020 in q-bio.BM and cs.CL

Abstract: Less than 1% of protein sequences are structurally and functionally annotated. NLP community has recently embraced self-supervised learning as a powerful approach to learn representations from unlabeled text, in large part due to the attention-based context-aware Transformer models. In this work we present a modification to the RoBERTa model by inputting during pre-training a mixture of binding and non-binding protein sequences (from STRING database). However, the sequence pairs have no label to indicate their binding status, as the model relies solely on Masked LLMing (MLM) objective during pre-training. After fine-tuning, such approach surpasses models trained on single protein sequences for protein-protein binding prediction, TCR-epitope binding prediction, cellular-localization and remote homology classification tasks. We suggest that the Transformer's attention mechanism contributes to protein binding site discovery. Furthermore, we compress protein sequences by 64% with the Byte Pair Encoding (BPE) vocabulary consisting of 10K subwords, each around 3-4 amino acids long. Finally, to expand the model input space to even larger proteins and multi-protein assemblies, we pre-train Longformer models that support 2,048 tokens. Further work in token-level classification for secondary structure prediction is needed. Code available at: https://github.com/PaccMann/paccmann_proteomics

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Modestas Filipavicius (4 papers)
  2. Matteo Manica (28 papers)
  3. Joris Cadow (5 papers)
  4. Maria Rodriguez Martinez (24 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.