Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks (2403.00043v2)

Published 29 Feb 2024 in q-bio.BM and cs.LG

Abstract: While RNA has recently been recognized as an interesting small-molecule drug target, many challenges remain to be addressed before we take full advantage of it. This emphasizes the necessity to improve our understanding of its structures and functions. Over the years, sequencing technologies have produced an enormous amount of unlabeled RNA data, which hides a huge potential. Motivated by the successes of protein LLMs, we introduce RiboNucleic Acid LLM (RiNALMo) to unveil the hidden code of RNA. RiNALMo is the largest RNA LLM to date, with 650M parameters pre-trained on 36M non-coding RNA sequences from several databases. It can extract hidden knowledge and capture the underlying structure information implicitly embedded within the RNA sequences. RiNALMo achieves state-of-the-art results on several downstream tasks. Notably, we show that its generalization capabilities overcome the inability of other deep learning methods for secondary structure prediction to generalize on unseen RNA families.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. The Protein Data Bank. Nucleic Acids Research, 28(1):235–242, 01 2000. ISSN 0305-1048. doi: 10.1093/nar/28.1.235. URL https://doi.org/10.1093/nar/28.1.235.
  2. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8):2102–2110, 2022.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. An RNA foundation model enables discovery of disease mechanisms and candidate therapeutics. bioRxiv, 2023. doi: 10.1101/2023.09.20.558508. URL https://www.biorxiv.org/content/early/2023/09/26/2023.09.20.558508.
  5. xTrimoPGLM: Unified 100B-scale pre-trained transformer for deciphering the language of protein. Biorxiv, 2024. doi: 10.1101/2023.07.05.547496. URL https://www.biorxiv.org/content/early/2024/01/11/2023.07.05.547496.
  6. Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. Eprint arXiv:2204.00300, 2022.
  7. Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction. bioRxiv, 2023. doi: 10.1101/2023.01.31.526427. URL https://www.biorxiv.org/content/early/2023/02/03/2023.01.31.526427.
  8. Targeting RNA structures with small molecules. Nature Reviews Drug Discovery, 21(10):736–762, 2022.
  9. PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  10. Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. Eprint arXiv:2307.08691, 2023.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. Eprint arXiv:1810.04805, 2018.
  12. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14):e90–e98, 2006. doi: 10.1093/bioinformatics/btl246. URL https://doi.org/10.1093/bioinformatics/btl246.
  13. ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7112–7127, 2022. doi: 10.1109/TPAMI.2021.3095381.
  14. ProtGPT2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
  15. UFold: fast and accurate RNA secondary structure prediction with deep learning. Nucleic Acids Research, 50(3):e14–e14, 2021. doi: 10.1093/nar/gkab1074. URL https://doi.org/10.1093/nar/gkab1074.
  16. Garner, A. L. Contemporary progress and opportunities in RNA-targeted drug discovery. ACS Medicinal Chemistry Letters, 14(3):251–259, 2023.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  18. Modeling aspects of the language of life through transfer-learning protein sequences. BMC bioinformatics, 20(1):1–17, 2019.
  19. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, Feb 2021. doi: 10.1093/bioinformatics/btab083. URL https://doi.org/10.1093/bioinformatics/btab083.
  20. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
  21. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Research, 49(D1):D192–D200, 2020. doi: 10.1093/nar/gkaa1047. URL https://doi.org/10.1093/nar/gkaa1047.
  22. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023. doi: 10.1126/science.ade2574. URL https://www.science.org/doi/abs/10.1126/science.ade2574.
  23. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41:1099–1106, 2023.
  24. Ensembl 2023. Nucleic Acids Research, 51(D1):D933–D941, 2022. doi: 10.1093/nar/gkac958. URL https://doi.org/10.1093/nar/gkac958.
  25. Mathews, D. H. How to benchmark RNA secondary structure prediction accuracy. Methods, 162-163:60–67, 2019. doi: https://doi.org/10.1016/j.ymeth.2019.04.003. URL https://www.sciencedirect.com/science/article/pii/S1046202318303402.
  26. Leveraging protein language models for accurate multiple sequence alignments. Genome Research, 33(7):1145–1153, 2023.
  27. Transforming the language of life: transformer neural networks for protein prediction tasks. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp.  1–8, 2020.
  28. ProGen2: exploring the boundaries of protein language models. Cell Systems, 14(11):968–978, 2023.
  29. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp.  8024–8035, 2019.
  30. Improving language understanding by generative pre-training. 2018.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  32. RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11:129, 2010.
  33. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  34. RNAcentral Consortium. RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Research, 49(D1):D212–D220, 2020. doi: 10.1093/nar/gkaa921. URL https://doi.org/10.1093/nar/gkaa921.
  35. Human 5’ UTR design and variant effect prediction from a massively parallel translation assay. Nature Biotechnology, 37(7):803–809, Jul 2019. doi: 10.1038/s41587-019-0164-5. URL https://doi.org/10.1038/s41587-019-0164-5.
  36. RNA secondary structure prediction using deep learning with thermodynamic integration. Nature Communications, 12(1):941, Feb 2021. doi: 10.1038/s41467-021-21194-4. URL https://doi.org/10.1038/s41467-021-21194-4.
  37. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Research, 51(D1):D29–D38, 2022. doi: 10.1093/nar/gkac1032. URL https://doi.org/10.1093/nar/gkac1032.
  38. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics, 21:1–20, 2020.
  39. Spliceator: Multi-species splice site prediction using convolutional neural networks. BMC Bioinformatics, 22(1):1–26, 2021.
  40. Shazeer, N. GLU variants improve transformer. Eprint arXiv:2002.05202, 2020.
  41. E2Efold-3D: end-to-end deep learning method for accurate de novo RNA 3D structure prediction. Eprint arXiv:2207.01586, 2022.
  42. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nature Communications, 10(1):5407, Nov 2019. doi: 10.1038/s41467-019-13395-9. URL https://doi.org/10.1038/s41467-019-13395-9.
  43. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11):1026–1028, 2017. doi: 10.1038/nbt.3988. URL https://doi.org/10.1038/nbt.3988.
  44. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  45. Deep learning models for RNA secondary structure prediction (probably) do not generalize across families. Bioinformatics, 38(16):3892–3899, 2022. doi: 10.1093/bioinformatics/btac415. URL https://doi.org/10.1093/bioinformatics/btac415.
  46. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  47. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  48. Uni-RNA: Universal pre-trained models revolutionize RNA research. bioRxiv, 2023. doi: 10.1101/2023.07.11.548588. URL https://www.biorxiv.org/content/early/2023/07/12/2023.07.11.548588.
  49. RNA secondary structure packages evaluated and improved by high-throughput experiments. Nature Methods, 19(10):1234–1242, 2022.
  50. High-resolution de novo structure prediction from primary sequence. bioRxiv, 2022. doi: 10.1101/2022.07.21.500999. URL https://www.biorxiv.org/content/early/2022/07/22/2022.07.21.500999.
  51. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp.  10524–10533. PMLR, 2020.
  52. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence, 4:852–866, 2022.
  53. RNAcmap: a fully automatic pipeline for predicting contact maps of RNAs by evolutionary coupling analysis. Bioinformatics, 37(20):3494–3500, 05 2021. doi: 10.1093/bioinformatics/btab391. URL https://doi.org/10.1093/bioinformatics/btab391.
  54. Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Research, 52(1):e3–e3, 11 2024. doi: 10.1093/nar/gkad1031. URL https://doi.org/10.1093/nar/gkad1031.
  55. DNABERT-2: Efficient foundation model and benchmark for multi-species genome. Eprint arXiv:2306.15006, 2023.
Citations (18)

Summary

  • The paper introduces RiNALMo, a 650M-parameter RNA language model trained on 36M sequences to achieve state-of-the-art structure prediction.
  • The model employs techniques like RoPE, SwiGLU, and FlashAttention-2 to enhance efficiency and enable robust cross-family generalization.
  • RiNALMo also excels in functional tasks such as splice-site and ribosome loading predictions, broadening its impact in computational biology.

An Expert Overview of RiNALMo: A General-Purpose RNA LLM

The paper "RiNALMo: General-Purpose RNA LLMs Can Generalize Well on Structure Prediction Tasks" introduces a novel approach in the domain of computational biology by applying a large-scale LLM to RNA sequences. The proposed model, RiNALMo, a RiboNucleic Acid LLM, represents a significant advancement at the intersection of machine learning and bioinformatics, focusing on the structural prediction capabilities of RNA.

Model and Dataset

RiNALMo is distinguished by its substantial size, featuring 650 million parameters, which is a significant scale, especially within the context of RNA LLMs. It was pre-trained using 36 million non-coding RNA sequences sourced from several publicly available databases, including RNAcentral, Rfam, and others. The architecture leverages advanced techniques such as a BERT-style Transformer encoder, incorporating architectural innovations like rotary positional embedding (RoPE), SwiGLU activation, and FlashAttention-2 for efficient training.

Structural Prediction Capabilities

The central thesis of the paper posits that RiNALMo can implicitly capture structural information embedded within RNA sequences, outperforming existing methods in secondary structure prediction tasks. In quantitative assessments, RiNALMo achieved state-of-the-art results on various benchmarks, including intra-family and inter-family RNA secondary structure prediction tasks. Notably, it showcases remarkable generalization to RNA families not encountered during training, a significant leap over conventional deep learning models that struggle with cross-family generalization for such tasks.

Functional Tasks and Evaluation

Beyond structural prediction, RiNALMo was evaluated on tasks associated with RNA functions. It demonstrates robust performance in multi-species splice-site prediction, outperforming existing specialized models like SpliceBERT and Spliceator. Furthermore, in predicting mean ribosome loading (MRL), RiNALMo exhibited superior generalization capability to human UTRs, despite being fine-tuned exclusively on randomly derived sequences.

Architectural and Training Considerations

The paper underscores the architectural choices and training regimen that contribute to the model's performance. Techniques such as RoPE and the SwiGLU activation function are highlighted for their roles in enhancing model capacity and representation. Pre-training, conducted using masked LLMing (MLM) with a carefully curated dataset, provides RiNALMo a robust foundation, facilitating its effective application in both structural and functional RNA tasks.

Implications and Future Work

The introduction of RiNALMo holds multiple implications for computational biology. The ability to predict RNA structures and functions with enhanced generalization opens up new possibilities in RNA-related studies and drug discovery. The success of RiNALMo suggests that LLMs, traditionally used for natural language processing, can be repurposed to decode the complex biological information contained in RNA sequences.

Looking forward, the authors suggest exploring the model's application in tertiary structure prediction tasks, potentially creating a unified framework for RNA structural and functional prediction. The findings might inspire further research into other biomolecular LLMs, extending similar methodologies to proteins, DNAs, and other complex biological molecules.

Conclusion

RiNALMo represents a substantial stride in RNA LLMing, offering a more generalized approach to RNA structure prediction tasks. Its ability to effectively generalize across RNA families unseen during training challenges the current limitations of deep learning approaches. The model's applicability to a wide range of tasks, coupled with its robust architecture, positions it as a promising tool for advancing RNA research and applications, ultimately signaling a broader shift towards data-driven methods in understanding biological complexity.

Github Logo Streamline Icon: https://streamlinehq.com