Impact of MARS-scale RNA pretraining data on RNA language model performance
Determine whether pretraining RNA language models on the Master database of all possible RNA Sequences (MARS), which integrates RNAcentral, transcriptome assemblies, metagenome assemblies, and genomic sequences (~1 billion RNA sequences), improves model performance relative to pretraining on smaller datasets such as RNAcentral, particularly in the context of training and evaluation for RNA secondary structure prediction and RNA family/type classification.
References
However, it is not certain if a large RNA database (MARS) will also help because UNI-RNA is not yet openly available.
— A Comparative Review of RNA Language Models
(2505.09087 - Wang et al., 14 May 2025) in Discussion