SR-Eval Benchmark: Multi-turn Code Generation

Updated 27 September 2025

SR-Eval Benchmark is a multi-turn code generation evaluation framework that models iterative developer interactions through staged requirement refinement.
It employs a multi-agent pipeline with requirement decomposition and semantic-aware test case generation to assess LLMs on both function-level and repository-level tasks.
Empirical results reveal challenges in context management and error propagation, underscoring the need for improved prompt strategies and adaptive reasoning.

SR-Eval Benchmark is a multi-turn code generation evaluation framework designed to assess LLMs under stepwise requirement refinement. Unlike conventional, single-turn code completion benchmarks, SR-Eval formalizes the code generation task as a progressive, iterative workflow spanning both function-level and repository-level programming in Python and Java. Its methodology integrates a multi-agent requirement decomposition pipeline, semantic-aware discriminative test case generation, and three prompting strategies to simulate realistic developer interactions. Empirical results reveal significant gaps between current LLM capabilities and human-like iterative development, motivating research into more robust context management and adaptive prompt techniques (Zhan et al., 23 Sep 2025).

1. Motivation and Scope

SR-Eval originates from the observation that prevailing code generation benchmarks, such as HumanEval and MBPP, insufficiently capture the iterative nature of real-world software development. In practice, developer requirements evolve incrementally, necessitating repeated code editing and adaptation over multiple interaction rounds. SR-Eval addresses this unmet need by reframing code generation evaluation as a stepwise, multi-turn process, focusing on the ability of LLMs to handle changing requirements, preserve context, and produce consistently correct outputs across progressive modifications.

The benchmark covers:

Function-level tasks: iterative modification of single functions.
Repository-level tasks: multi-file, context-rich programming scenarios.

SR-Eval comprises 443 multi-turn tasks and 1,857 total evaluation points, split across both granular and global code contexts.

2. Requirement Generation Pipeline

SR-Eval employs a multi-agent system to simulate the creation and evolution of software requirements and ensure authentic multi-turn interaction traces, which are not available in public datasets. The requirement generation pipeline involves three agents:

Decomposer: Splits complex requirements into a core (seed) requirement and several subsequent refinement steps, mimicking the logical progression of real-world development.
Evaluator: Assesses each sub-requirement for testability, scenario authenticity, completeness, and distinctiveness. Only well-posed and discriminative steps are retained.
Analyzer: Organizes the finalized requirements into a directed acyclic graph (DAG format) to maintain a valid execution order for code and test generation.

This framework reconstructs iterative interaction chains by expanding otherwise static benchmarks into multi-stage problem settings that force LLMs to engage with temporally-evolving specifications.

3. Semantic-Aware Test Case Generation

To enforce discriminative, turn-specific evaluation, SR-Eval introduces a semantic-aware test case generation component. The process, formalized in Algorithm 1 (“Semantic-aware Discriminative Test Case Generation”), operates as follows:

Correctness Validation: Candidate code is compiled/executed in a sandbox (for function-level) or repository context (for multi-file tasks), and validated against reference test cases.
Distinctiveness Validation: For each requirement step, test cases are generated to specifically detect differences with previous turning points, avoiding false positives from unchanged or semantically inconsistent changes.
Semantic Alignment: An auxiliary LLM-based evaluator verifies that newly issued test cases accurately capture the newly specified requirements.

Additional validation is standardized through Algorithm 2 (“Unified Check”). This module ensures that both objective correctness and subjective semantic intent are maintained across all interaction turns.

4. Evaluation Protocol

SR-Eval benchmarks eleven representative LLMs under three prompting strategies, each designed to emulate different user–model interaction modalities:

Full History: The prompt contains the complete conversation history, including every previous instruction and model output.
Code Edit: Only the previously generated code and the immediate new requirement are provided, discarding historical instructions.
Cumulative Instruction: Instructions are accumulated across turns, omitting intermediate model generated code.

Evaluation proceeds under two main context settings:

Basic Setting: History is constructed from prior model-generated outputs, potentially propagating errors from earlier rounds.
Golden Setting: History uses ground-truth reference implementations, preventing error accumulation.

Primary metrics are:

pass@1: The proportion of tasks for which the candidate code passes all test cases on the first attempt.
Per-Turn Accuracy: Fraction of turns solved correctly.
Complete Rate (CR): Proportion of multi-turn tasks completed successfully throughout all steps.
Average Token Cost (ATC): Measures contextual efficiency per prompting strategy.

5. Empirical Findings

Results show:

The best-performing LLM achieves a 22.67% completion rate on function-level tasks and 20.00% on repository-level tasks under basic (“realistic”) conditions.
Performance consistently improves in the golden setting, where prior errors do not propagate, indicating LLMs are sensitive to context contamination.
Increasing model scale leads to moderate gains in accuracy, but explicit reasoning-oriented models can exhibit “overthinking,” degrading performance in some iterative scenarios.
Prompting strategies significantly affect outcomes; the Code Edit approach generally offers the most efficient trade-off between accuracy and token cost.

Observed limitations include:

Difficulties in managing evolving, complex context across turns.
Accumulation and propagation of errors as requirements change.
Challenges in discriminative test case construction for nuanced requirement updates.

6. Benchmark Construction and Technical Details

Key technical aspects:

Both Python and Java are covered, enabling assessment across different programming ecosystems.
Each multi-turn task includes triplets: next instruction, reference solution, discriminative test suite.
The pipeline employs algorithms that automate both requirement decomposition and rigorous test case creation, ensuring reproducibility and consistent evaluation per turn.
The semantic-aware mechanism is central to discriminative verification, which enhances the fidelity of iterative benchmarking.

No complex mathematical formulas are introduced, but evaluation flows are documented via algorithmic pseudocode in the paper.

7. Future Research Directions

SR-Eval points toward necessary innovations in code-generation evaluation and LLM design:

Integration of self-correcting architectures and code-review mechanisms to handle error accumulation.
Advanced context management, such as pruning, summarization, or selective history inclusion to reduce token overhead and context drift.
Explicit, task-aware activation of reasoning modules only for suitable iterative scenarios.
Enhanced techniques for generating discriminative, semantically aligned test cases.

The benchmark establishes a new standard for multi-turn, iterative code generation assessment, bridging the gap between academic static benchmarks and the context-rich, evolving workflows characteristic of practical software engineering. Further development of adaptive LLM-based approaches is required to achieve robust, human-like performance in iterative coding environments (Zhan et al., 23 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement (2025)

Follow Topic

Get notified by email when new papers are published related to SR-Eval Benchmark.