Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions (2408.16245v5)

Published 29 Aug 2024 in cs.LG and q-bio.BM

Abstract: The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. To date, most biosequence transformers have been trained on single-omic data-either proteins or nucleic acids and have seen incredible success in downstream tasks in each domain, with particularly noteworthy breakthroughs in protein structural modeling. However, single-omic pre-training limits the ability of these models to capture cross-modal interactions. Here we present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data. We show that despite only being trained on unlabeled sequence data, OmniBioTE learns joint representations mapping genes to their corresponding protein sequences. We further demonstrate that OmbiBioTE achieves state-of-the-art results predicting the change in Gibbs free energy ({\Delta}G) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, compared to single-omic controls trained with identical compute, OmniBioTE demonstrates superior performance-per-FLOP across both multi-omic and single-omic benchmarks, highlighting the power of a unified modeling approach for biological sequences.

Summary

The paper introduces OmniBioTE, the first multi-omic transformer model that integrates nucleotide and peptide sequences for joint representation learning.
The method uses masked language modeling on 250B tokens to achieve a Pearson correlation of 0.940 for ΔG and 0.892 for ΔΔG predictions.
The model’s emergent structural understanding enhances prediction of peptide residues in nucleotide binding, offering new insights for drug discovery.

Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions

Overview

The paper presents a novel approach to developing multi-omic models using the transformer architecture, specifically targeting peptide-nucleotide interactions. Unlike previous research efforts that focused on single-omic domains, this work pioneers the integration of multi-omic data to capture and learn the interactions between nucleotide and peptide sequences effectively. The authors introduce the OmniBioTE suite of models, which are trained on a large, diverse corpus of biosequences, and they demonstrate the models' capabilities and generalizability through various downstream tasks and benchmarks.

Key Contributions

The primary contributions of this research can be summarized as follows:

Introduction of Multi-omic Models (MOMs): The authors propose the first multi-omic transformer models trained on both nucleotide and peptide sequences. These models, termed OmniBioTE, are designed to learn joint representations from multi-omic data.
Training Methodology: The models are pre-trained using masked-language-modeling on a substantial dataset of 250 billion tokens, encompassing various nucleotide and peptide sequences. The training leverages the proven effectiveness of transformer architectures in handling sequential data, extended to the field of multi-omic biosequences.
Evaluation on Multi-omic Tasks: The researchers validate the models’ performance through two primary tasks: predicting the Gibbs free energy (ΔG) of peptide-nucleotide binding interactions and the change in Gibbs free energy (ΔΔG) due to nucleotide mutations. The models achieve state-of-the-art results, showcasing the effectiveness of multi-omic training.
Emergent Structural Learning: Remarkably, the models exhibit an emergent ability to learn structural information from the primary sequences without explicit structural training. This allows the prediction of peptide residues involved in nucleotide binding interactions.
Performance on Single-omic Tasks: Despite being trained on multi-omic data, the models’ performance on single-omic tasks remains competitive. This indicates minimal performance degradation and establishes the generalizability of the multi-omic approach.

Numerical Results and Claims

Multi-omic Task Performance

The models' proficiency in predicting binding interactions and mutations is notably better than recent deep-learning architectures focused on these tasks. For instance, OmniBioTE-XL achieves a Pearson correlation coefficient (PCC) of 0.940 for ΔG and 0.892 for ΔΔG, significantly surpassing the DeePNAP model, which achieves 0.825 and 0.392 respectively.

Emergent Learning and Joint Representations

The models show that peptide embeddings are substantially similar to their corresponding mRNA sequences, which aligns with the Central Dogma of molecular biology. This emergent learning is evident even at smaller scales but becomes more pronounced as the models scale up, implying a robust capability to learn joint representations efficiently.

Implications and Future Directions

Practical Implications

The ability to predict peptide-nucleotide interactions with high accuracy has direct applications in pharmaceutical development, particularly in the design of nucleotide aptamers and understanding the impact of genetic mutations on protein functions. These models could accelerate drug discovery and aid in developing targeted therapies.

Theoretical Implications

The research opens new avenues in multi-omic modeling, suggesting that joint modality training can encapsulate complex biological relationships better than single-omic models. This could lead to a more unified understanding of biological processes and interactions at the molecular level.

Speculations on Future Developments

In the AI landscape, multi-omic models like OmniBioTE could pave the way for further innovations in biosequence modeling. Future work might involve integrating more advanced training objectives such as structure or property prediction and adopting architectural innovations from multimodal vision-LLMs. Moreover, scaling the models with larger datasets and compute power could yield even greater performance, potentially leading to breakthroughs in modeling other complex bio-molecular interactions.

Conclusion

The paper convincingly demonstrates the utility of transformer-based multi-omic models in biosequence analysis, particularly in modeling peptide-nucleotide interactions. OmniBioTE models show significant promise, both in terms of setting new benchmarks for specific tasks and in suggesting a paradigm shift towards more integrated and holistic modeling approaches for biological data. The research lays a strong foundation for future advancements in multi-omic bioinformatics and offers a glimpse into the transformative potential of AI in molecular biology.