LMR-Bench: Benchmark for LLM Code Synthesis

Updated 30 July 2025

LMR-Bench is a benchmark framework designed to systematically evaluate LLMs' capability to reproduce masked research code extracted from scientific papers and repositories.
It employs a dual evaluation approach using unit tests and an LLM-as-a-judge to assess both logical correctness and implementation fidelity.
The benchmark features 28 reproduction tasks from 23 NLP papers across 9 research areas, highlighting challenges in repository synthesis and scientific reasoning.

LMR-Bench is a benchmark framework specifically designed to systematically evaluate the ability of LLM agents to autonomously reproduce code from leading research papers in the field of language modeling. The benchmark targets the complex intersection of scientific reasoning, code synthesis, and multi-file repository comprehension, providing a rigorous, real-world testing ground for LLM-driven research automation and code understanding (Yan et al., 19 Jun 2025).

1. Benchmark Definition and Motivation

LMR-Bench addresses a critical gap in LLM evaluation: the capability to accurately reproduce research code from NLP literature based solely on a paper and its corresponding open-source repository. The benchmark is structured around the following scenario: each task supplies an agent with a research paper and a repository in which one or more key functions have been "masked" (removed or obfuscated). The agent must synthesize implementations for these functions using information extracted from both the manuscript and the remainder of the codebase.

The underlying motivation is to measure scientific code reproduction—a task requiring deep comprehension, algorithmic reasoning, and the capacity to synthesize multi-file logic. This goes beyond superficial code generation benchmarks and targets fundamental bottlenecks in scientific automation using LLMs.

2. Dataset Composition and Task Categories

LMR-Bench consists of 28 reproduction tasks derived from 23 peer-reviewed NLP research papers published in top-tier venues over the last five years. Each task is constructed by:

Manually mapping high-level components (algorithms, procedures, or model elements) described in the paper to their implementation in the codebase.
Refactoring the relevant functions into self-contained forms with explicit input and output signatures to minimize extraneous dependencies.

The coverage spans nine foundational research areas, including:

Training objectives and optimization methods
Prompt engineering and instruction tuning
Neural network architectures (including transformer variants)
Representation learning mechanisms
Evaluation metrics and data pipeline components

The diversity ensures the benchmark probes a broad spectrum of code understanding and synthesis challenges, from mathematical module recreation to the integration of custom neural layers and dataset loaders.

3. Evaluation Protocols and Metrics

LMR-Bench implements a dual evaluation paradigm:

Unit-Test Based Evaluation: Each predicted function is inserted into the repository and executed in a dedicated Docker environment against problem-specific test cases. A task is marked "pass" only if all tests are satisfied. Formally,

$\text{Accuracy} = \frac{\# \text{functions passing all tests}}{\# \text{total functions}}$

LLM-as-a-Judge: Generated code is compared against the golden implementation using an independent LLM agent as a judge. Outputs are classified as:
- Logically Incorrect
- Logically Correct but Incorrectly Implemented
- Completely Correct

Experiments include both "standard prompting"—where the code synthesis prompt is carefully engineered—and "LLM agent settings," presenting the agent with the complete file structure and PDF, simulating end-to-end automated research code reproduction.

4. Experimental Results and Analysis

Empirical results reveal persistent limitations in the scientific reasoning and code synthesis capabilities of current state-of-the-art LLMs:

Unit test accuracy across leading models (e.g., GPT-4 variants, o4-mini) remains in the 39%–43% range for complete passes.
LLM-judge assessment further shows that while a minority of outputs are "completely correct," many are only "logically correct" or fail for subtle implementation issues.
Comparison of prompting strategies indicates that although agents may appear to generate more "paper-faithful" code, end-to-end performance is hampered by failure modes such as incomplete extraction of algorithmic detail, brittle handling of repository dependencies, and cross-file linkage errors.
Statistical regression analysis demonstrates that structural factors—such as repository organization, average directory depth, and branching factor—are predictive of code reproduction success.

This quantitative and qualitative evidence exposes significant shortcomings in the autonomy of LLMs for scientific code tasks, especially when code requires integrating information spread across text and multi-file repositories.

5. Core Challenges: Scientific Reasoning and Repository Synthesis

LMR-Bench demonstrates several key obstacles to reliable scientific code reproduction:

Mathematical/Algorithmic Extraction: LLMs frequently struggle to extract concrete, unambiguous algorithmic steps from the often abstract, high-level descriptions typical in research papers.
Detailed Implementation: Translation of algorithmic intent to code, especially with research-specific edge cases, is error-prone. LLMs may produce skeletons or templates but miss crucial input validation, shape handling, or numerical details.
Inter-File and Contextual Reasoning: Repository-level integration, in which complete functionality depends on interdependent files (e.g., model, dataset, utility modules), remains a challenging bottleneck due to limited context retrieval and weak multi-step reasoning.

These challenges directly impact reproducibility rates and indicate that current models do not yet achieve reliable paper-to-code synthesis for complex scientific projects.

6. Impact, Limitations, and Future Directions

LMR-Bench provides a robust measurement standard that makes several research avenues explicit:

Automatic Data Scaling: The current benchmark relies on manual curation of code and gold implementations. The authors argue for increased automation and semi-automation in data construction to reduce overhead and increase coverage.
Deeper Semantic Parsing: Improved capabilities in parsing papers—including accurate extraction of mathematical notation, complex equations, and layout—are critical for further advances.
Enhanced Agent Design: Integrating improved context retrieval, iterative solution refinement, and dynamic multi-file code integration are highlighted as research priorities.
Fine-Grained Error Attribution: Richer evaluation strategies, possibly with dynamic policy revision and reranking, will be needed to distinguish between errors arising from semantic misunderstanding and implementation defects.

LMR-Bench is positioned to be a key resource for tracking progress toward automated, reproducible research and clarifies the state of scientific reasoning in modern LLMs.

7. Summary Table: Benchmark Structure

Component	Quantity / Type	Role
Papers	23 (NLP, top-tier venues)	Source of algorithmic tasks
Reproduction Tasks	28	Code function synthesis benchmarks
Categories	9 (objectives, prompts, architectures, …)	Coverage of core research challenges
Evaluation Methods	Unit-test, LLM-judge	Pass/fail + fine-grained correctness
Repository Form	Masked functions, Dockerized repositories	End-to-end dependency assessment

This structure ensures that the LMR-Bench benchmark is comprehensive, challenging, and reflects real scientific engineering requirements.

By establishing a challenging and multi-faceted benchmark for code reproduction grounded in actual research output, LMR-Bench enables precise quantification and systematic identification of gaps in the synthesis and reasoning abilities of LLMs, setting the stage for targeted advancements in scientific AI.

PDF Markdown Chat (Pro)

References (1)

LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research (2025)

LMR-Bench: Benchmark for LLM Code Synthesis

1. Benchmark Definition and Motivation

2. Dataset Composition and Task Categories

3. Evaluation Protocols and Metrics

4. Experimental Results and Analysis

5. Core Challenges: Scientific Reasoning and Repository Synthesis

6. Impact, Limitations, and Future Directions

7. Summary Table: Benchmark Structure

Whiteboard

Follow Topic

Continue Learning

LMR-Bench: Benchmark for LLM Code Synthesis

1. Benchmark Definition and Motivation

2. Dataset Composition and Task Categories

3. Evaluation Protocols and Metrics

4. Experimental Results and Analysis

5. Core Challenges: Scientific Reasoning and Repository Synthesis

6. Impact, Limitations, and Future Directions

7. Summary Table: Benchmark Structure

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics