Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reprogramming Pretrained Language Models for Antibody Sequence Infilling (2210.07144v2)

Published 5 Oct 2022 in q-bio.BM and cs.LG

Abstract: Antibodies comprise the most versatile class of binding molecules, with numerous applications in biomedicine. Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency. Unique to antibodies, designing the complementarity-determining region (CDR), which determines the antigen binding affinity and specificity, creates its own unique challenges. Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance, particularly lacking diversity in the generated sequences. In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data - where it may be difficult to train a high-performing model from scratch or effectively fine-tune an existing pre-trained model on the specific task. Specifically, we introduce ReprogBert in which a pretrained English LLM is repurposed for protein sequence infilling - thus considers cross-language adaptation using less data. Results on antibody design benchmarks show that our model on low-resourced antibody sequence dataset provides highly diverse CDR sequences, up to more than a two-fold increase of diversity over the baselines, without losing structural integrity and naturalness. The generated sequences also demonstrate enhanced antigen binding specificity and virus neutralization ability. Code is available at https://github.com/IBM/ReprogBERT

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. RosettaAntibodyDesign (RAbD): A general framework for computational antibody design. PLoS computational biology, 14(4):e1006112, 2018.
  2. In silico proof of principle of machine learning-based antibody design at unconstrained scale. In Mabs, volume 14, pp.  2031482, 2022.
  3. Designing feature-controlled humanoid antibody discovery libraries using generative adversarial networks. BioRxiv, 2020.
  4. Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022.
  5. Generative modeling for protein structures. Advances in neural information processing systems, 31, 2018.
  6. Fold2seq: A joint sequence (1d)-fold (3d) embedding-based generative model for protein design. In International Conference on Machine Learning, pp. 1261–1271, 2021.
  7. Chen, P.-Y. Model reprogramming: Resource-efficient cross-domain machine learning. arXiv preprint arXiv:2202.10629, 2022.
  8. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nature Biomedical Engineering, 5(6):613–623, 2021.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. SAbDab: the structural antibody database. Nucleic Acids Research, 42(D1), 11 2013.
  11. IG-VAE: Generative modeling of immunoglobulin proteins by direct 3d coordinate generation. bioRxiv, 2020.
  12. ProtTrans: Towards cracking the language of life’s code through self-supervised deep learning and high performance computing. bioRxiv, 2020.
  13. Adversarial reprogramming of neural networks. arXiv preprint arXiv:1806.11146, 2018.
  14. Antibody complementarity determining regions (CDRs) design using constrained energy model. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  389–399, 2022.
  15. Neural message passing for quantum chemistry. In International conference on machine learning, pp. 1263–1272, 2017.
  16. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  17. Design of metalloproteins and novel protein folds using variational autoencoders. Scientific reports, 8(1):1–12, 2018.
  18. WARP: Word-level Adversarial ReProgramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), August 2021.
  19. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  20. Parameter-efficient transfer learning for NLP. CoRR, abs/1902.00751, 2019.
  21. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 2018.
  22. RosettaRemodel: a generalized framework for flexible backbone protein design. PloS one, 6(8):e24109, 2011.
  23. Generative models for graph-based protein design. Advances in neural information processing systems, 32, 2019.
  24. Emerging new therapeutic antibody derivatives for cancer treatment. Signal Transduction and Targeted Therapy, 7(1):1–28, 2022.
  25. Jin, W. GitHub repository for Iterative refinement graph neural network for antibody sequence-structure co-design (RefineGNN), 2022. URL https://github.com/wengong-jin/RefineGNN.
  26. Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624, 2021.
  27. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  28. De novo protein design for novel folds using guided conditional wasserstein generative adversarial networks. Journal of chemical information and modeling, 60(12):5667–5681, 2020.
  29. AntBO: Towards real-world automated antibody design with combinatorial bayesian optimisation. arXiv preprint arXiv:2201.12570, 2022.
  30. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  31. Conditional antibody design as 3D equivariant graph translation. arXiv preprint arXiv:2208.06073, 2022.
  32. Observed Antibody Space: A Resource for Data Mining Next-Generation Sequencing of Antibody Repertoires. The Journal of Immunology, 201(8):2502–2509, 2018.
  33. Rosetta3: An object-oriented software suite for the simulation and design of macromolecules, pp.  545–574. 2011.
  34. Proteinsgm: Score-based generative modeling for de novo protein design. bioRxiv, 2022.
  35. Lei, T. When attention meets fast recurrence: Training language models with reduced compute. arXiv preprint arXiv:2102.12459, 2021.
  36. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691, 2021.
  37. OptMAVEn–a new framework for the de novo design of antibody variable region models targeting specific antigen epitopes. PlOS one, 9(8):e105954, 2014.
  38. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4582–4597, August 2021.
  39. Antigen-specific antibody design and optimization with diffusion-based generative models. bioRxiv, 2022.
  40. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nature Biomedical Engineering, 5(6):600–612, 2021.
  41. Benchmarking deep generative models for diverse antibody sequence design. arXiv preprint arXiv:2111.06801, 2021.
  42. Cross-modal adversarial reprogramming. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  2427–2435, 2022.
  43. ProGen2: exploring the boundaries of protein language models. arXiv preprint arXiv:2206.13517, 2022.
  44. Observed antibody space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Science, 31(1):141–146, 2022.
  45. OptCDR: a general computational method for the design of antibody complementarity determining regions for targeted epitope binding. Protein Engineering, Design & Selection, 23(11):849–858, 2010.
  46. Five computational developability guidelines for therapeutic antibody profiling. Proceedings of the National Academy of Sciences, 116(10):4025–4030, 2019.
  47. CoV-AbDab: the Coronavirus Antibody Database. Bioinformatics, 37(5):734–735, 2021.
  48. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15), 2021.
  49. Ruder, S. Recent Advances in Language Model Fine-tuning. http://ruder.io/recent-advances-lm-fine-tuning, 2021.
  50. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. bioRxiv, 2022.
  51. Antibody design using lstm based deep generative model from phage display library for affinity maturation. Scientific reports, 11(1):1–13, 2021.
  52. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
  53. Protein design and variant prediction using autoregressive generative models. Nature communications, 12(1):1–11, 2021.
  54. Clustering huge protein sequence sets in linear time. Nature communications, 9(1):1–8, 2018.
  55. Fast and flexible design of novel proteins using graph neural networks. BioRxiv, pp.  868935, 2020.
  56. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31(6):926–932, 2015.
  57. Deep learning of protein sequence design of protein-protein interactions. bioRxiv, 2022.
  58. How many protein sequences fold to a given structure? A coevolutionary analysis. Biophysical journal, 113(8):1719–1730, 2017.
  59. Ablang: An antibody language model for completing antibody sequences. bioRxiv, 2022.
  60. Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources. In International Conference on Machine Learning, pp. 9614–9624, 2020.
  61. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  62. Reprogramming language models for molecular representation learning. arXiv preprint arXiv:2012.03460, 2020.
  63. Computational protein design with deep learning neural networks. Scientific reports, 8(1):1–9, 2018.
  64. The origin of CDR H3 structural diversity. Structure, 23(2):302–311, 2015.
  65. Voice2Series: Reprogramming acoustic models for time series classification. In International Conference on Machine Learning, pp. 11808–11819, 2021.
  66. Improving molecular design by stochastic iterative target augmentation. In International Conference on Machine Learning, pp. 10716–10726, 2020.
  67. Grammar of protein domain architectures. Proceedings of the National Academy of Sciences, 116(9):3636–3645, 2019.
  68. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
Citations (22)

Summary

We haven't generated a summary for this paper yet.