RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks (2403.00043v2)
Abstract: While RNA has recently been recognized as an interesting small-molecule drug target, many challenges remain to be addressed before we take full advantage of it. This emphasizes the necessity to improve our understanding of its structures and functions. Over the years, sequencing technologies have produced an enormous amount of unlabeled RNA data, which hides a huge potential. Motivated by the successes of protein LLMs, we introduce RiboNucleic Acid LLM (RiNALMo) to unveil the hidden code of RNA. RiNALMo is the largest RNA LLM to date, with 650M parameters pre-trained on 36M non-coding RNA sequences from several databases. It can extract hidden knowledge and capture the underlying structure information implicitly embedded within the RNA sequences. RiNALMo achieves state-of-the-art results on several downstream tasks. Notably, we show that its generalization capabilities overcome the inability of other deep learning methods for secondary structure prediction to generalize on unseen RNA families.
- The Protein Data Bank. Nucleic Acids Research, 28(1):235–242, 01 2000. ISSN 0305-1048. doi: 10.1093/nar/28.1.235. URL https://doi.org/10.1093/nar/28.1.235.
- ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics, 38(8):2102–2110, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- An RNA foundation model enables discovery of disease mechanisms and candidate therapeutics. bioRxiv, 2023. doi: 10.1101/2023.09.20.558508. URL https://www.biorxiv.org/content/early/2023/09/26/2023.09.20.558508.
- xTrimoPGLM: Unified 100B-scale pre-trained transformer for deciphering the language of protein. Biorxiv, 2024. doi: 10.1101/2023.07.05.547496. URL https://www.biorxiv.org/content/early/2024/01/11/2023.07.05.547496.
- Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. Eprint arXiv:2204.00300, 2022.
- Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction. bioRxiv, 2023. doi: 10.1101/2023.01.31.526427. URL https://www.biorxiv.org/content/early/2023/02/03/2023.01.31.526427.
- Targeting RNA structures with small molecules. Nature Reviews Drug Discovery, 21(10):736–762, 2022.
- PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
- Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. Eprint arXiv:2307.08691, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. Eprint arXiv:1810.04805, 2018.
- CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14):e90–e98, 2006. doi: 10.1093/bioinformatics/btl246. URL https://doi.org/10.1093/bioinformatics/btl246.
- ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7112–7127, 2022. doi: 10.1109/TPAMI.2021.3095381.
- ProtGPT2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022.
- UFold: fast and accurate RNA secondary structure prediction with deep learning. Nucleic Acids Research, 50(3):e14–e14, 2021. doi: 10.1093/nar/gkab1074. URL https://doi.org/10.1093/nar/gkab1074.
- Garner, A. L. Contemporary progress and opportunities in RNA-targeted drug discovery. ACS Medicinal Chemistry Letters, 14(3):251–259, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Modeling aspects of the language of life through transfer-learning protein sequences. BMC bioinformatics, 20(1):1–17, 2019.
- DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, Feb 2021. doi: 10.1093/bioinformatics/btab083. URL https://doi.org/10.1093/bioinformatics/btab083.
- Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
- Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Research, 49(D1):D192–D200, 2020. doi: 10.1093/nar/gkaa1047. URL https://doi.org/10.1093/nar/gkaa1047.
- Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023. doi: 10.1126/science.ade2574. URL https://www.science.org/doi/abs/10.1126/science.ade2574.
- Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41:1099–1106, 2023.
- Ensembl 2023. Nucleic Acids Research, 51(D1):D933–D941, 2022. doi: 10.1093/nar/gkac958. URL https://doi.org/10.1093/nar/gkac958.
- Mathews, D. H. How to benchmark RNA secondary structure prediction accuracy. Methods, 162-163:60–67, 2019. doi: https://doi.org/10.1016/j.ymeth.2019.04.003. URL https://www.sciencedirect.com/science/article/pii/S1046202318303402.
- Leveraging protein language models for accurate multiple sequence alignments. Genome Research, 33(7):1145–1153, 2023.
- Transforming the language of life: transformer neural networks for protein prediction tasks. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–8, 2020.
- ProGen2: exploring the boundaries of protein language models. Cell Systems, 14(11):968–978, 2023.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035, 2019.
- Improving language understanding by generative pre-training. 2018.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- RNAstructure: software for RNA secondary structure prediction and analysis. BMC Bioinformatics, 11:129, 2010.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
- RNAcentral Consortium. RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Research, 49(D1):D212–D220, 2020. doi: 10.1093/nar/gkaa921. URL https://doi.org/10.1093/nar/gkaa921.
- Human 5’ UTR design and variant effect prediction from a massively parallel translation assay. Nature Biotechnology, 37(7):803–809, Jul 2019. doi: 10.1038/s41587-019-0164-5. URL https://doi.org/10.1038/s41587-019-0164-5.
- RNA secondary structure prediction using deep learning with thermodynamic integration. Nature Communications, 12(1):941, Feb 2021. doi: 10.1038/s41467-021-21194-4. URL https://doi.org/10.1038/s41467-021-21194-4.
- Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Research, 51(D1):D29–D38, 2022. doi: 10.1093/nar/gkac1032. URL https://doi.org/10.1093/nar/gkac1032.
- A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics, 21:1–20, 2020.
- Spliceator: Multi-species splice site prediction using convolutional neural networks. BMC Bioinformatics, 22(1):1–26, 2021.
- Shazeer, N. GLU variants improve transformer. Eprint arXiv:2002.05202, 2020.
- E2Efold-3D: end-to-end deep learning method for accurate de novo RNA 3D structure prediction. Eprint arXiv:2207.01586, 2022.
- RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nature Communications, 10(1):5407, Nov 2019. doi: 10.1038/s41467-019-13395-9. URL https://doi.org/10.1038/s41467-019-13395-9.
- MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11):1026–1028, 2017. doi: 10.1038/nbt.3988. URL https://doi.org/10.1038/nbt.3988.
- RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Deep learning models for RNA secondary structure prediction (probably) do not generalize across families. Bioinformatics, 38(16):3892–3899, 2022. doi: 10.1093/bioinformatics/btac415. URL https://doi.org/10.1093/bioinformatics/btac415.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Uni-RNA: Universal pre-trained models revolutionize RNA research. bioRxiv, 2023. doi: 10.1101/2023.07.11.548588. URL https://www.biorxiv.org/content/early/2023/07/12/2023.07.11.548588.
- RNA secondary structure packages evaluated and improved by high-throughput experiments. Nature Methods, 19(10):1234–1242, 2022.
- High-resolution de novo structure prediction from primary sequence. bioRxiv, 2022. doi: 10.1101/2022.07.21.500999. URL https://www.biorxiv.org/content/early/2022/07/22/2022.07.21.500999.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524–10533. PMLR, 2020.
- scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence, 4:852–866, 2022.
- RNAcmap: a fully automatic pipeline for predicting contact maps of RNAs by evolutionary coupling analysis. Bioinformatics, 37(20):3494–3500, 05 2021. doi: 10.1093/bioinformatics/btab391. URL https://doi.org/10.1093/bioinformatics/btab391.
- Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Research, 52(1):e3–e3, 11 2024. doi: 10.1093/nar/gkad1031. URL https://doi.org/10.1093/nar/gkad1031.
- DNABERT-2: Efficient foundation model and benchmark for multi-species genome. Eprint arXiv:2306.15006, 2023.