How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks

Published 12 Apr 2026 in cs.SE and cs.AI | (2604.10508v1)

Abstract: LLMs frequently fail to produce correct code on their first attempt, yet most benchmarks evaluate them in a single-shot setting. We investigate iterative self-repair (feeding execution errors back to the model for correction) across seven models spanning three families and both open-weight and proprietary providers: Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout (MoE, 16 experts), Llama 4 Maverick (MoE, 128 experts), Qwen3 32B, Gemini 2.5 Flash, and Gemini 2.5 Pro. On HumanEval (164 problems) and MBPP Sanitized (257 problems) with up to five attempts, self-repair universally improves pass rates: +4.9 to +17.1 pp on HumanEval and +16.0 to +30.0 pp on MBPP. Gemini 2.5 Flash achieves the highest final pass rates (96.3% HumanEval, 93.8% MBPP). Most gains concentrate in the first two rounds.Error-type analysis shows assertion errors (logical mistakes) are the hardest to repair at ~45%, while syntax and name errors are repaired at substantially higher rates, connecting to broader findings on the limits of LLM self-correction. Prior work found that weaker models fail at self-repair or require fine-tuning; we show that modern instruction-tuned models succeed with prompting alone, even at 8B scale. We also provide the first comparison of dense and MoE architectures for self-repair, and extend the repair-vs-resampling tradeoff analysis to modern models. A prompt ablation reveals chain-of-thought repair yields up to +5.5 pp additional self-repair gain (measured as improvement in repair delta) over minimal prompting for capable models.

Abstract PDF Upgrade to Chat

Authors (1)

Johin Johny Arimbur

Summary

The paper demonstrates that iterative self-repair consistently improves code generation accuracy across diverse LLM architectures, with gains up to +30.0 percentage points.
It quantifies error repair rates by category, revealing that assertion errors are less fixable (~45%) than name errors (~77%), thereby outlining key model limitations.
The study shows that most improvements occur in the first two repair rounds, emphasizing efficiency gains and the added benefits of chain-of-thought prompts in larger models.

Iterative Self-Repair in LLM Code Generation: Scaling, Error Analysis, and Prompting

Introduction and Motivation

This paper presents an exhaustive empirical study of iterative self-repair for LLM code generation across diverse architectures, parameter scales, and benchmarks (2604.10508). The central focus is the effectiveness and characteristics of self-repair—where a model, upon producing a faulty code sample, is prompted to correct itself using execution error feedback and its prior attempt. Evaluation covers seven models from Meta, Alibaba, and Google, including both dense and Mixture-of-Experts (MoE) configurations, on HumanEval and MBPP.

Single-pass pass@1 benchmarks are standard, but they ignore the iterative, error-driven workflow that dominates human programming. This work systematically quantifies (1) how much accuracy can be recovered via iterative self-repair, (2) how efficiency and gains change with model family and scale, and (3) which error types define the practical limits of LLM self-correction.

Experimental Design

Seven models ranging from Llama 3.1 8B through Gemini 2.5 Pro are compared, with key axes including parameter scale, dense vs. MoE architectures, and open-weight vs. proprietary family. HumanEval (n=164) and MBPP Sanitized (n=257) serve as benchmarks, with each sample going through up to five attempts—initial plus four repair rounds.

All models employ greedy decoding for determinism. Error messages, code outputs, and previous attempts are composed into prompts fed to the model for self-repair. Error types are categorized (assertion, syntax, name, type/value, index/key, timeout) to analyze which failures are most/least self-correctable. A prompt ablation investigates the impact of minimal, explain-then-fix, and chain-of-thought (CoT) prompting.

Self-Repair Performance Across Models and Benchmarks

Iterative self-repair yields universal gains for all models, regardless of scale and architecture. On HumanEval, self-repair improvements ( $\Delta$ ) range from +4.9 to +17.1 percentage points (pp), with Gemini 2.5 Flash achieving the highest final pass@1 (96.3%).

Figure 1: Cumulative pass@1 rate on HumanEval across repair rounds for all seven models, with Gemini 2.5 Flash reaching 96.3%; most gains are realized within the first two rounds.

MBPP sees even larger repair gains (up to +30.0 pp for Gemini 2.5 Flash), reflecting more headroom and the impact of error type.

Figure 2: Self-repair gains are consistently larger for MBPP than HumanEval, with Gemini 2.5 Flash outperforming all other models (+30.0 pp on MBPP).

Gemini 2.5 Flash consistently establishes a new ceiling for open-weight and proprietary models, outpacing even larger dense and deeper MoE models. Notably, self-repair works robustly with only prompt-based error feedback at the 8B scale, contrary to prior work that required fine-tuning to enable self-repair for weaker models.

Error Taxonomy and Self-Repair Limits

A critical, novel contribution is the breakdown of error repairability by category. Across all models, assertion errors are the predominant failure mode and are the hardest to repair: only $\sim$ 45% are fixed through iterative self-repair. In contrast, name errors (undefined variables etc.) are repaired at rates $\sim$ 77%.

Figure 3: Distribution of error types at the initial attempt (R₀) on HumanEval for each model.

Figure 4: Repair success rate by error type; assertion errors demonstrate marked resistance to self-repair compared to name and syntax errors.

This suggests fundamental limitations: while LLMs reliably fix syntactic or superficial bugs via error feedback, “silent” logical mistakes that yield assertion errors remain challenging due to insufficient specificity in error signals.

Repair Dynamics and Marginal Returns

The incremental benefit of each repair round exhibits rapidly diminishing returns.

Figure 5: The first repair round always provides the largest marginal gain; additional rounds have sharply diminishing returns across all models.

Across all benchmarks and models, the majority of repairable failures are fixed in the first two rounds (76–95% of total possible gain). For deployment, this implies that allocating budget for two repair iterations captures almost all practical benefit from self-repair.

Prompt Design and Repair Efficacy

Prompt ablation reveals that more sophisticated error feedback strategies—specifically chain-of-thought (CoT)—yield notable additive gains for stronger models. The CoT strategy provides up to +5.5 pp additional self-repair gain on the 70B model relative to a minimal prompt. This effect scales positively with model capacity.

Figure 6: Chain-of-thought repair prompts consistently generate the highest pass rates; the benefit over minimal prompting increases with model capability.

Self-Repair vs. Independent Resampling

The study contrasts self-repair with multiple independent resampling (e.g., pass@5 with stochastic sampling). For models ≥70B, self-repair is consistently more token-efficient and either matches or exceeds the best-of-five pass rate. For smaller models, resampling can provide slightly higher coverage, highlighting a nuanced trade-off.

Figure 7: Self-repair is more token-efficient and matches/exceeds resampling for all capable models; only the 8B model falls behind resampling in final pass rate.

Implications for LLM Code Generation

Self-repair is now a universally valid, training-free mechanism for model improvement. Even 8B-scale, instruction-tuned models achieve reliable gains with only prompt-based feedback.
Practical systems should allocate at least one but likely not more than two repair rounds, trading off computation against sharply diminishing marginal returns.
Prompt strategy matters non-trivially for high-end models; CoT strategies further close gaps.
Error-type analysis provides actionable cues: for failures dominated by assertion errors, alternative approaches (like program synthesis or diversified resampling) may be necessary.
Dense and MoE architectures exhibit distinct gains and plateau points; however, sufficiently large dense models are competitive with or surpass MoE alternatives in self-repair efficiency.

Theoretical and Practical Considerations

The observed ceiling on assertion error repair suggests a disconnect between error feedback granularity and the required logical revision, aligning with limitations found in general LLM self-correction. Theoretical work is needed to better characterize when error signals can effectively inform semantic correction, and how to bridge the “reasoning gap” for assertion-type failures.

From a deployment standpoint, models like Gemini 2.5 Flash exemplify architectures and training paradigms that maximize both first-pass accuracy and repair capability. This has implications for the design of developer-facing LLM tools, continuous integration bots, and real-time or iterative code synthesis environments.

Conclusion

This paper establishes prompt-based iterative self-repair as a highly effective method for improving the functional correctness of code generated by modern LLMs. Gains are robust across parameter scales, architectures, and benchmark domains, and extend to small models that previously required fine-tuning for self-repair. The marginal utility of each repair iteration and error-type sensitivity are quantified in detail, providing deployment guidelines and identifying future research directions for tool-augmented reasoning and richer error feedback mechanisms.