Automatic or semi-automatic data collection for LMR-Bench
Develop automatic or semi-automatic methods for collecting high-quality data points for the LMR-Bench benchmark, which evaluates large language model agents on reproducing code from NLP research papers and associated repositories, in order to reduce PhD-level manual annotation cost and improve scalability while maintaining benchmark quality.
Sponsor
References
To ensure the high quality of our benchmark, the annotation cost is high and scalability is difficult since it requires PhD-level expertise. How to enable automatic or semiâautomatic data points collection is an open problem.
— LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research
(2506.17335 - Yan et al., 19 Jun 2025) in Section: Limitations