Explain RiNALMo's Poor Generalization to Telomerase RNA

Determine the factors that cause RiNALMo, a 650-million-parameter encoder-only Transformer language model pre-trained on 36 million non-coding RNA sequences using masked language modeling, to achieve a markedly low F1 score on telomerase RNAs in the inter-family secondary structure prediction benchmark of Szikszai et al. (2022), where the model is trained on eight RNA families and evaluated on the held-out telomerase RNA family.

Background

The inter-family generalization dataset contains 3,865 RNAs from nine families and is split nine times such that one family is held out for evaluation while the other eight are used for training and validation. RiNALMo significantly outperforms thermodynamics-based and deep learning methods on eight of the nine families, but its F1 score on telomerase RNA is notably low (0.12) compared to its performance on other families.

The authors observe that telomerase RNAs are the longest in the dataset, on average approximately 25% longer than the second-longest family, and that t-SNE embeddings show telomerase RNAs clustered without a clear boundary from SRP RNAs. Interestingly, UFold performs best on telomerase RNAs while much worse on other families. The specific reasons underlying RiNALMo’s failure on telomerase RNAs remain undetermined and are highlighted for future investigation.

References

We are currently unable to conclude why RiNALMo fails on telomerase RNAs, but will take more focus on this problem in the future.

RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks (2403.00043 - Penić et al., 29 Feb 2024) in Section 4.1 (Secondary Structure Prediction), paragraph discussing inter-family generalization; after Table 2