Reproducibility of BEACON benchmark evaluations
Ascertain whether the BEACON (Benchmark for Comprehensive RNA Task and Language Models) evaluation can be made reproducible by validating the custom implementations of its metrics, verifying the integrity of its datasets, and identifying and resolving the missing datasets that currently prevent reporting of results.
References
We have completed data organization and the compilation of benchmark tasks. However, we are currently unable to provide results because the experimental outcomes cannot be reproduced. Our next steps involve verifying the custom implementation of the evaluation metrics and checking the integrity of the dataset. Additionally, there are still missing dataset in the BEACON benchmark. We have submitted an issue to the authors and are awaiting a response.