Reproducibility of BEACON benchmark evaluations

Ascertain whether the BEACON (Benchmark for Comprehensive RNA Task and Language Models) evaluation can be made reproducible by validating the custom implementations of its metrics, verifying the integrity of its datasets, and identifying and resolving the missing datasets that currently prevent reporting of results.

Background

BEACON is a large-scale benchmark tailored for RNA understanding, comprising 13 tasks across structural analysis, functional studies, and engineering applications. It is integrated into OmniGenBench to enable standardized evaluation of RNA-focused Genomic Foundation Models.

In the BEACON Results section, the authors report that they cannot currently reproduce experimental outcomes for BEACON. They indicate that verification of custom metric implementations, dataset integrity checks, and resolution of missing datasets are needed, and they have contacted the original authors for assistance. This establishes a concrete unresolved reproducibility issue within the BEACON benchmark integration.

References

We have completed data organization and the compilation of benchmark tasks. However, we are currently unable to provide results because the experimental outcomes cannot be reproduced. Our next steps involve verifying the custom implementation of the evaluation metrics and checking the integrity of the dataset. Additionally, there are still missing dataset in the BEACON benchmark. We have submitted an issue to the authors and are awaiting a response.

— OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking (2505.14402 - Yang et al., 20 May 2025) in Appendix D.7, BEACON Results (page 28)

Reproducibility of BEACON benchmark evaluations

Sponsor

Background

References

Related Problems