- The paper quantifies LLM instability in automated bug injection and correction by measuring both syntactic (Levenshtein similarity) and semantic (OER) variability.
- The paper analyzes the impact of temperature settings and problem types, showing that higher temperatures increase output variability and reduce functional equivalence.
- The paper proposes stabilization protocols, including multi-sample ensembling and patch clustering, to enhance reliability in software engineering workflows.
Instability of LLMs in Automated Bug Injection and Correction
Introduction
This paper presents a systematic empirical analysis of the instability exhibited by LLMs, specifically ChatGPT (GPT-4), in automated bug injection and correction tasks. While LLMs have demonstrated strong capabilities in code generation and automated program repair (APR), their non-deterministic behavior—manifested as output variability across repeated runs with identical inputs—poses significant challenges for reliability, reproducibility, and integration into software engineering workflows. The paper focuses on quantifying both structural (syntactic) and functional (semantic) instability, examining the influence of the temperature parameter and problem type, and proposing methodological protocols for more robust LLM-based error correction.
Experimental Design and Metrics
The analysis leverages the QuixBugs benchmark, comprising 20 Python algorithmic problems with diverse error types (logic, boundary, control flow). For each buggy code fragment, ChatGPT is prompted nine times at three temperature settings (0.0, 0.5, 1.0), yielding 540 outputs. Two primary metrics are employed:
- Levenshtein Similarity: Measures character-level syntactic similarity between generated fixes.
- Output Equivalence Rate (OER): Assesses functional equivalence by comparing outputs across a finite test set.
This dual-metric approach enables differentiation between superficial code similarity and true semantic correctness, addressing the limitations of relying solely on syntactic measures.
Results: Instability Across Temperature and Problem Types
Syntactic Instability
- Temperature Sensitivity: As temperature increases, average Levenshtein similarity decreases, and the proportion of low-similarity (<0.7) cases rises. At temperature 0.0, many problems (e.g., flatten, kth, subsequences) yield highly similar fixes, while others (e.g., bucketsort, breadth_first_search) exhibit substantial diversity even at low temperature.
- Problem-Specific Effects: Template-like and arithmetic problems are robust to temperature changes, maintaining high similarity and low variance. In contrast, graph traversal and control flow-intensive problems show pronounced instability, with large drops in similarity and increased dispersion at higher temperatures.
Semantic Instability
- Functional Equivalence: OER declines monotonically with temperature (0.70 at T=0.0, 0.67 at T=0.5, 0.62 at T=1.0). The share of fully successful problems drops from 55% to 40%, while failures increase from 35% to 45%.
- Task Sensitivity: Deterministic tasks (bitcount, flatten, kth) achieve near-perfect OER across all settings. Graph-based and logic-intensive tasks (breadth_first_search, detect_cycle_test, shortest_path_length_test) suffer high failure rates, with OER approaching zero in some cases.
- Residual Non-Determinism: Even at temperature 0.0, full determinism is not achieved; output variability persists due to stochastic elements in the generation stack beyond temperature control.
Variance and Reliability
- Error Bars and Dispersion: Standard deviation analysis reveals that some problems (e.g., lis, next_permutation, shunting_yard) exhibit high run-to-run variability, especially as temperature increases. Others remain stable, indicating that instability is not uniform across tasks.
Methodological Implications and Stabilization Protocols
The findings underscore the necessity of multi-sample, variance-aware evaluation protocols for LLM-based bug fixing. The paper proposes several stabilization techniques:
- Multi-Sample Ensembling: Generate multiple candidate fixes and select based on test-aware criteria (OER threshold).
- Patch Clustering and Standardization: Group structurally similar patches and select representatives to reduce review burden and maintenance overhead.
- CI/CD Integration: Employ OER-based admission gates to filter out chance-passing patches, enhancing reliability in deployment pipelines.
- Longitudinal and Comparative Evaluation: Use stability metrics to benchmark LLMs, decoding settings, and model versions, supporting reproducibility and governance.
These protocols are particularly critical in safety-critical domains (finance, healthcare, automotive) and large-scale codebases, where instability can propagate significant risks.
Limitations and Future Directions
The paper is limited to single-file Python problems and a single LLM architecture (GPT-4). Future research should extend to multi-file and multi-language scenarios, alternative LLMs, and more advanced equivalence metrics (e.g., AST-based, semantic similarity). Long-term stability testing and the impact of model version drift are identified as important areas for further investigation.
Conclusion
This work provides a rigorous quantification of LLM instability in automated bug injection and correction, demonstrating that output variability is pervasive, temperature-dependent, and highly task-sensitive. The inability to achieve full determinism, even at zero temperature, highlights fundamental limitations in current LLM architectures for reliable software engineering applications. The proposed stability-oriented protocols offer practical pathways to mitigate these risks, but unsupervised or one-off use of LLMs for bug fixing remains inadvisable in high-stakes contexts. Methodological advances in sampling, selection, and standardization are essential for integrating LLMs into robust software development workflows. The implications extend to model evaluation, governance, and the broader adoption of LLMs in automated program repair.