ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments

Published 27 Feb 2025 in cs.SE, cs.AI, and cs.CL | (2502.19852v1)

Abstract: LLMs have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions, limiting our ability to evaluate LLMs in these contexts. To address this gap, we present a set of novel benchmarks that explicitly model the quality of feedback provided to code generation LLMs. Our contributions are threefold: First, we introduce CONVCODEWORLD, a novel and reproducible environment for benchmarking interactive code generation. CONVCODEWORLD simulates 9 distinct interactive code generation scenarios while systematically combining three types of feedback: (a) compilation feedback; (b) execution feedback with varying test coverage; (c) verbal feedback generated by GPT-4o with different levels of expertise. Second, we introduce CONVCODEBENCH, a fast, static version of benchmark that uses pre-generated feedback logs, eliminating the need for costly dynamic verbal feedback generation while maintaining strong Spearman's rank correlations (0.82 to 0.99) with CONVCODEWORLD. Third, extensive evaluations of both closed-source and open-source LLMs including R1-Distill on CONVCODEWORLD reveal key insights: (a) LLM performance varies significantly based on the feedback provided; (b) Weaker LLMs, with sufficient feedback, can outperform single-turn results of state-of-the-art LLMs without feedback; (c) Training on a specific feedback combination can limit an LLM's ability to utilize unseen combinations; (d) LLMs solve problems in fewer turns (high MRR) may not solve as many problems overall (high Recall), and vice versa. All implementations and benchmarks will be made publicly available at https://huggingface.co/spaces/ConvCodeWorld/ConvCodeWorld

Abstract PDF Upgrade to Chat

Summary

The paper establishes that multi-turn feedback—combining compilation, execution, and verbal cues—is crucial for realistic code generation evaluation.
It demonstrates that weaker LLMs can rival state-of-the-art performance through iterative refinement with expert-level feedback.
The study reveals a trade-off between rapid solution generation (MRR) and overall task completion (Recall), informing model selection based on specific deployment needs.

Benchmarking Conversational Code Generation with ConvCodeWorld

Introduction

The ConvCodeWorld benchmark represents a systematic advance in evaluating interactive code generation using LLMs. Traditional code generation evaluation protocols primarily target single-turn performance and fail to reflect the diversity and complexity of real-world human-AI code development. ConvCodeWorld addresses this gap through a comprehensive, reproducible experimental environment that enables controlled yet realistic experimentation over the feedback channel—considering compilation, execution, and critical verbal modalities, both at novice and expert levels. Complementing this, ConvCodeBench offers a static and cost-effective alternative for scalable, high-fidelity benchmarking, validated by extremely high rank correlations (0.82–0.99) with ConvCodeWorld.

Figure 1: ConvCodeWorld and ConvCodeBench: dynamic and static benchmark frameworks enabling scalable, feedback-diverse code generation evaluation.

Feedback Combinatorics: Simulation of Real-world Pair Programming

ConvCodeWorld formalizes interactive code generation as a sequence of (problem, solution, feedback) tuples, iterated over at most 10 turns. Each step incorporates feedback in three orthogonal axes:

Compilation Feedback ( $f_c$ ): Syntax/type error diagnostics from the compiler, offering minimal and always-available guidance.
Execution Feedback: Either partial ( $f_e$ ) or full ( $f^*$ ) test coverage. The test suite's coverage is carefully controlled for branch comprehensiveness, isolating test generalization versus test utilization ability.
Verbal Feedback ( $f_v$ ): Synthesized using a powerful LLM (GPT-4o), with differentiation between novice (potentially noisy and paraphrastic) and expert (targeted, reference-based analysis) modalities.

Combining these axes, ConvCodeWorld spans 9 distinct feedback settings, each corresponding to a plausible real-world scenario (e.g., novice feedback only, full test cases with expert coaching, etc.), allowing targeted model dissection and robustness stress-testing.

ConvCodeBench: Efficient and Reproducible Static Benchmark

Dynamic LLM-in-the-loop benchmarks incur significant monetary and latency costs, especially at scale or under closed API constraints. ConvCodeBench mitigates these by leveraging pre-generated feedback logs with a fixed reference model (CodeLlama-7B-Instruct), covering all settings where LLM participation is non-trivial (i.e., those involving verbal feedback). Extensive experiments demonstrate that rankings and metric differentials (MRR, Recall) on ConvCodeBench closely track those of ConvCodeWorld, validating its use as a reliable and efficient proxy, with Spearman’s $\rho$ routinely in the 0.9–0.99 range.

Figure 2: Correlation between MRR on ConvCodeBench and ConvCodeWorld across feedback settings—demonstrating high-fidelity proxy behavior.

Experimental Protocol and Metrics

The authors employ an extensive experimental grid:

Models: 21 open- and closed-source LLMs, including GPT-4o, Llama-3.1-70B, DeepSeek-Coder-V2, ReflectionCoder-DS, and R1-Distill models.
Dataset: An extended version of BigCodeBench-Full-Instruct (1,140 Python tasks, high branch/test coverage).
Evaluation: Core focus on Pass@1 at each turn, with overall assessment via Mean Reciprocal Rank (MRR, efficiency of problem completion in fewer turns) and Recall (task completion rate within 10 turns). These two metrics provide insight into both the model’s proficiency per attempt and total coverage across diverse benchmarks.

Key Results

Impact of Feedback Diversity

Model performance is highly sensitive to the specific feedback scenario. Ranking order among LLMs varies significantly depending on the available signals (compilation, execution, verbal). This heterogeneity directly challenges the validity of single-turn or single-feedback-point evaluations.

Figure 3: Iterative Pass@1 of multiple LLMs on ConvCodeWorld, for various feedback combinations.

Exploiting Feedback: Weaker LLMs Surpass SOTA with Sufficient Feedback

A critical empirical claim is that relatively weak LLMs (e.g., DeepSeek-Coder-6.7B, CodeQwen1.5-7B) can achieve or even exceed the single-turn performance of state-of-the-art LLMs (GPT-4o, GPT-4-Turbo) via multi-turn refinement, provided they have access to rich, high-quality feedback (in particular, expert-level verbal feedback in addition to test feedback).

Generalization Limits

Specialization to a single feedback regime during fine-tuning harms adaptability to new ones. The ReflectionCoder-DS family, explicitly optimized for a single feedback setting, fails to generalize as effectively when evaluated outside of its training scenario. This suggests that coverage of feedback axes during training is critical for robust model deployment in the wild.

MRR-Recall Tradeoff

The study finds a notable MRR-Recall trade-off: LLMs able to complete tasks rapidly (high MRR) are not always those that maximize the number of tasks solved overall (high Recall), and vice versa. Model selection should thus be feedback-regime and downstream-deployment-objective aware.

Figure 4: LLM-specific iterative Pass@1 trajectories, highlighting model-specific feedback utilization characteristics.

Cost and Reproducibility

ConvCodeWorld can scale benchmarking efficiently: LLM-simulated feedback (via strong expert LLMs) costs approximately 1.5% of equivalent human annotation while maintaining high reproducibility and systematization. ConvCodeBench further reduces costs by eliminating dynamic LLM rollout, yet maintains robust rank fidelity with the dynamic benchmark.

Implications and Future Outlook

The paradigm established by ConvCodeWorld makes several clear contributions:

Benchmark Design: Demonstrates that a high-fidelity code generation benchmark must account for the combinatorics of feedback, especially verbal feedback from simulated users of varying expertise.
Model Development Guidance: Encourages future code LLMs to be robust not just to program synthesis but to the full diversity of feedback channels encountered in real developer environments, likely incentivizing multitask/few-shot and feedback-adaptive model designs.
Evaluation Protocols: ConvCodeBench's static architecture points to a future where benchmarks can efficiently scale to hundreds of models or RL policy rollouts without cost prohibitions or API bottlenecks.
Limitations and Future Directions: Feedback “utility curves” and model adaptability to out-of-domain feedback types remain open research problems. Further, as LLMs increasingly serve as both code generators and feedback providers, co-evolutionary approaches and adversarial benchmarks (e.g., noisy/hallucinated or malicious verbal feedback) could reveal model limitations and align with real-world robustness objectives.

Conclusion

ConvCodeWorld and ConvCodeBench collectively advance the methodological rigor with which multi-turn, feedback-driven conversational code generation is evaluated. The diversity and granularity of feedback supported by this benchmark enable robust and nuanced model comparison, reflecting requirements for real-world deployment. High correlation between the static and dynamic benchmarks indicates that practical, cost-effective large-scale evaluation is within reach, even for the broader research community. This framework sets a new standard for multi-turn code generation evaluation, and calls for future models explicitly adapted to feedback-heterogeneous scenarios.

Markdown Report Issue