Explain RiNALMo's Poor Generalization to Telomerase RNA
Determine the factors that cause RiNALMo, a 650-million-parameter encoder-only Transformer language model pre-trained on 36 million non-coding RNA sequences using masked language modeling, to achieve a markedly low F1 score on telomerase RNAs in the inter-family secondary structure prediction benchmark of Szikszai et al. (2022), where the model is trained on eight RNA families and evaluated on the held-out telomerase RNA family.
Sponsor
References
We are currently unable to conclude why RiNALMo fails on telomerase RNAs, but will take more focus on this problem in the future.
— RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks
(2403.00043 - Penić et al., 29 Feb 2024) in Section 4.1 (Secondary Structure Prediction), paragraph discussing inter-family generalization; after Table 2