Automatic or semi-automatic data collection for LMR-Bench

Develop automatic or semi-automatic methods for collecting high-quality data points for the LMR-Bench benchmark, which evaluates large language model agents on reproducing code from NLP research papers and associated repositories, in order to reduce PhD-level manual annotation cost and improve scalability while maintaining benchmark quality.

Background

LMR-Bench is curated through a labor-intensive process requiring experienced researchers to select suitable papers, align algorithms to repository code, refactor and mask functions, and construct unit tests within Dockerized environments. This expert-driven workflow ensures quality but is costly and difficult to scale.

The authors emphasize that the reliance on PhD-level expertise makes large-scale expansion challenging. Automating or semi-automating the data point collection pipeline—while preserving the benchmark’s methodological rigor and reproducibility—would address these scalability and cost constraints.

References

To ensure the high quality of our benchmark, the annotation cost is high and scalability is difficult since it requires PhD-level expertise. How to enable automatic or semi–automatic data points collection is an open problem.

LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research (2506.17335 - Yan et al., 19 Jun 2025) in Section: Limitations