Impact of MARS-scale RNA pretraining data on RNA language model performance

Determine whether pretraining RNA language models on the Master database of all possible RNA Sequences (MARS), which integrates RNAcentral, transcriptome assemblies, metagenome assemblies, and genomic sequences (~1 billion RNA sequences), improves model performance relative to pretraining on smaller datasets such as RNAcentral, particularly in the context of training and evaluation for RNA secondary structure prediction and RNA family/type classification.

Background

The paper conducts a zero-shot comparative evaluation of 13 RNA LLMs, along with DNA and protein LMs as controls, focusing on RNA secondary structure prediction and RNA classification. It identifies trade-offs between models that excel at structural prediction versus functional classification and notes the influence of model size and training data on performance.

In discussing paths for improvement, the authors assert that larger model sizes are likely beneficial but explicitly question whether simply increasing the size of the RNA pretraining dataset—specifically using the large-scale MARS database—would further improve RNA LM performance. They note that this question is unresolved in part because UNI-RNA, a model associated with MARS-scale training, is not openly available for evaluation.

References

However, it is not certain if a large RNA database (MARS) will also help because UNI-RNA is not yet openly available.

A Comparative Review of RNA Language Models  (2505.09087 - Wang et al., 14 May 2025) in Discussion