LMR-Bench: Evaluating Code Reproduction by LLMs

Updated 6 October 2025

LMR-Bench is a systematic benchmark that evaluates LLM agents' ability to reproduce code from academic research papers using real repository challenges with masked functions.
The framework employs both automated unit tests and LLM-based code reviews to rigorously assess functional correctness and logical consistency.
By simulating complete research workflows, LMR-Bench identifies current limitations and guides future improvements in LLM architectures and prompt engineering.

LMR-Bench is a systematic benchmark framework designed to evaluate the ability of LLM agents to autonomously reproduce code from academic research papers in the language modeling domain. The framework was developed to confront the critical gap in assessing LLM agents’ capacity for scientific code synthesis, particularly in settings that demand complex reasoning, algorithmic comprehension, and multi-file repository navigation. LMR-Bench consists of a curated set of code reproduction tasks extracted from influential NLP research papers, each presenting masked functions within real repository environments and accompanied by informed instructions. Evaluation of agent outputs is performed through both automated unit tests and LLM-based code reviews to provide a rigorous assessment of functional and logical correctness.

1. Design Principles and Benchmark Structure

LMR-Bench was motivated by the observation that, while LLM agents have demonstrated utility across a range of program synthesis and scientific tasks, their proficiency at code reproduction from actual research publications—especially under constraints of incomplete repository context—remains insufficiently characterized.

The benchmark comprises 28 distinct code reproduction tasks, drawn from 23 original papers published in leading NLP venues over the preceding five years. Each task falls into one of nine research categories, reflecting pivotal areas in language modeling, such as generative modeling, reinforcement learning, prompt engineering, training objectives, decoding methods, and others.

For each reproduction task, annotators provide:

The full research paper or parsed algorithmic description.
A repository in which one or more critical functions have been masked (removed or obfuscated).
Explicit instructions detailing the requirements for each masked function, including cross-file dependencies and design intent.

This configuration creates a realistic scenario in which an LLM agent must integrate information from the paper, traverse repository structure, and reconstruct the missing code logic so that the overall repository remains functional.

2. Methodology: Data Annotation and Experimental Protocol

LMR-Bench relies on rigorous data preparation performed by experienced NLP annotators. Candidate papers for the benchmark are selected based on three critical criteria:

The presence of a methodological contribution, as opposed to survey or expository essays.
An official, reproducible code repository with resolved implementation issues.
Availability of a well-documented algorithmic component suitable for functional isolation.

After selection, a specific block or algorithm is mapped from the paper to its corresponding code section. Non-modular code is refactored into self-contained functions, and all dependencies—both intra- and inter-file—are documented for agent access. Code blocks are masked and reference (“golden”) implementations are stored for operator evaluation and unit testing.

Benchmark execution is performed in two primary settings:

Standard Prompting: The agent receives the masked code, relevant paper subsamples (structured as JSON), supporting code snippets, and a task description via a fixed prompt. Contextual limitations are observed due to model token ceilings.
LLM Agent Setting: The agent operates on an entire project directory containing the full repository and paper PDF, interacting through file queries, code execution, and information retrieval (e.g., using frameworks such as OpenHands), simulating end-to-end research workflows.

3. Evaluation Metrics and Assessment Methodology

LMR-Bench employs two complementary evaluation strategies, both automated and scalable:

Unit Test Evaluation: For each masked function, human experts design a targeted suite of approximately three rigorous unit tests. Implemented code is examined within an isolated Docker container to replicate the original runtime environment. The primary quantitative metric is unit-test accuracy, calculated as the proportion of tasks for which the model-generated function passes all tests:

$\text{accuracy} = \frac{L}{N}$

where $L$ is the number of functions passing all tests, and $N$ is the total number of attempted reproduction tasks.

LLM-as-a-Judge Evaluation: In parallel, an LLM is used to classify the correctness of each submission, considering:
1. Algorithmic logic alignment with the research paper’s specification.
2. Implementation quality, covering edge case handling, code robustness, and stylistic consistency.

Each submission is categorized as:

Logically Incorrect,
Logically Correct but Incorrectly Implemented, or
Completely Correct.

The prompt is standardized to elicit concise rationale and category in a JSON format.

These two axes of assessment—unit tests for functional correctness and LLM-based review for logical/algorithmic soundness—provide robust, multidimensional performance profiles for each LLM agent under evaluation.

4. Experimental Findings and Error Analysis

Empirical results reveal considerable limitations in current state-of-the-art LLM agents, including GPT-4o, GPT-4.1, and o4-mini. The highest observed unit-test accuracy reaches only ~42–43%, with many models failing the majority of tasks. Notably, agent-style repository interaction (via OpenHands) increases the frequency of logically correct implementations but does not necessarily improve pass rates for rigorous functional tests.

Error analysis identifies the following recurrent failure modes:

Algorithmic Misinterpretation: Difficulty extracting precise logic from complex paper descriptions, especially for novel or multi-step algorithms.
Cross-File Dependency Handling: Inadequate integration of dependencies across repository files, leading to semantically incomplete or brittle code.
Robustness Issues: Syntax errors, unhandled edge cases, and misconstrued input/output protocols.
Prompt/Context Limitations: Document parsing failures, out-of-context prompts, or restrictive safety policies that preclude code synthesis.
Execution Misalignment: Discrepancy between reasoning steps (“think” actions) and code execution, with agent efficacy more closely linked to the ratio of planning to running than to absolute action count.

Action-level logging further reveals that the composition of agent interactions—rather than sheer quantity—serves as a predictor for task fidelity.

5. Significance, Implications, and Future Directions

LMR-Bench conclusively demonstrates that, despite strong progress in generic program synthesis, contemporary LLM agents retain notable shortcomings in scientific reasoning and nuanced code reproduction. These gaps highlight fundamental challenges in automating reproducibility from research publications—particularly with respect to understanding algorithmic abstractions and implementing them cohesively across distributed and interdependent code bases.

Implications for future LLM research include:

Enhanced Architectures and Prompting Regimes: Emphasis on models and interaction protocols that better support multi-file retrieval, memory, and dependency resolution.
Benchmark Extension and Automation: Towards scalable, semi-automatic benchmark generation frameworks leveraging annotated repositories and algorithmic descriptions.
Iterative Error Correction: Incorporating feedback-driven refinement mechanisms to progress toward complete and functional code synthesis, informed by unit test failures and LLM review.
Comprehensive Evaluation: Combining automated, LLM-based, and human-in-the-loop evaluation to establish robust, multi-faceted measures of reproduction fidelity.

This benchmark thereby sets a foundation for the next generation of scientific code reproduction research, underscoring the need for sophisticated, context-aware, and robust LLM agents capable of reliably supporting reproducibility and verification in complex research environments.

6. Benchmark Availability and Usage Protocol

LMR-Bench provides reproducibility support through access to the full set of curation instructions, masked code repositories, and golden references, supporting automated assessment pipelines. Researchers are encouraged to implement both standard prompt-based and repository-level agent workflows, following detailed task and evaluation guidelines.

The framework is positioned to serve as both an empirical testing platform for future agent architectures and a practical diagnostic suite for assessing critical capabilities in LLM-driven scientific research synthesis. Its integration of rigorous annotation, multi-faceted evaluation, and real-world repository challenges represents a significant contribution to methodological rigor in automated code reproduction studies.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to LMR-Bench.