Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 30 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 12 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 184 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Analyzing the Instability of Large Language Models in Automated Bug Injection and Correction (2509.06429v1)

Published 8 Sep 2025 in cs.SE

Abstract: The use of LLMs in software engineering tasks is growing, especially in the areas of bug fixing and code generation. Nevertheless, these models often yield unstable results; when executed at different times with the same input, they can generate radically different code. The consistency of LLMs in bug-fixing tasks has not yet been thoroughly assessed, despite the fact that this instability has typically been discussed in the literature in relation to code generation. The purpose of this study is to look into how unstable an LLM like ChatGPT is when it comes to fixing code bugs. We examine the structural, syntactic, and functional variations among several fix recommendations made in response to the same prompt using code samples with various error types. Additionally, we assess how instability is affected by the temperature settings (0, 0.5, and 1) used for the model's deterministic operation. For a total of 20 problems in the experimental analysis, the model produced three fix suggestions at each temperature value, comparing nine distinct outputs for each problem. The Syntax Similarity and Output Equivalence Rate (OER) metrics were used to assess the outputs' structural and functional consistency. The results demonstrate that the model's outputs become much more unstable and variable as the temperature rises, with high temperatures showing especially high rates of functional failure. According to syntax similarity analyses, the suggested fixes show notable structural differences at high temperatures but are fairly similar at low temperatures. The purpose of this study is to provide important methodological insights into how LLM-based error correction systems can be applied more consistently in software development processes while also casting doubt on their dependability.

Collections

Summary

The paper quantifies LLM instability in automated bug injection and correction by measuring both syntactic (Levenshtein similarity) and semantic (OER) variability.
The paper analyzes the impact of temperature settings and problem types, showing that higher temperatures increase output variability and reduce functional equivalence.
The paper proposes stabilization protocols, including multi-sample ensembling and patch clustering, to enhance reliability in software engineering workflows.

Instability of LLMs in Automated Bug Injection and Correction

Introduction

This paper presents a systematic empirical analysis of the instability exhibited by LLMs, specifically ChatGPT (GPT-4), in automated bug injection and correction tasks. While LLMs have demonstrated strong capabilities in code generation and automated program repair (APR), their non-deterministic behavior—manifested as output variability across repeated runs with identical inputs—poses significant challenges for reliability, reproducibility, and integration into software engineering workflows. The paper focuses on quantifying both structural (syntactic) and functional (semantic) instability, examining the influence of the temperature parameter and problem type, and proposing methodological protocols for more robust LLM-based error correction.

Experimental Design and Metrics

The analysis leverages the QuixBugs benchmark, comprising 20 Python algorithmic problems with diverse error types (logic, boundary, control flow). For each buggy code fragment, ChatGPT is prompted nine times at three temperature settings (0.0, 0.5, 1.0), yielding 540 outputs. Two primary metrics are employed:

Levenshtein Similarity: Measures character-level syntactic similarity between generated fixes.
Output Equivalence Rate (OER): Assesses functional equivalence by comparing outputs across a finite test set.

This dual-metric approach enables differentiation between superficial code similarity and true semantic correctness, addressing the limitations of relying solely on syntactic measures.

Results: Instability Across Temperature and Problem Types

Syntactic Instability

Temperature Sensitivity: As temperature increases, average Levenshtein similarity decreases, and the proportion of low-similarity (<0.7) cases rises. At temperature 0.0, many problems (e.g., flatten, kth, subsequences) yield highly similar fixes, while others (e.g., bucketsort, breadth_first_search) exhibit substantial diversity even at low temperature.
Problem-Specific Effects: Template-like and arithmetic problems are robust to temperature changes, maintaining high similarity and low variance. In contrast, graph traversal and control flow-intensive problems show pronounced instability, with large drops in similarity and increased dispersion at higher temperatures.

Semantic Instability

Functional Equivalence: OER declines monotonically with temperature (0.70 at T=0.0, 0.67 at T=0.5, 0.62 at T=1.0). The share of fully successful problems drops from 55% to 40%, while failures increase from 35% to 45%.
Task Sensitivity: Deterministic tasks (bitcount, flatten, kth) achieve near-perfect OER across all settings. Graph-based and logic-intensive tasks (breadth_first_search, detect_cycle_test, shortest_path_length_test) suffer high failure rates, with OER approaching zero in some cases.
Residual Non-Determinism: Even at temperature 0.0, full determinism is not achieved; output variability persists due to stochastic elements in the generation stack beyond temperature control.

Variance and Reliability

Error Bars and Dispersion: Standard deviation analysis reveals that some problems (e.g., lis, next_permutation, shunting_yard) exhibit high run-to-run variability, especially as temperature increases. Others remain stable, indicating that instability is not uniform across tasks.

Methodological Implications and Stabilization Protocols

The findings underscore the necessity of multi-sample, variance-aware evaluation protocols for LLM-based bug fixing. The paper proposes several stabilization techniques:

Multi-Sample Ensembling: Generate multiple candidate fixes and select based on test-aware criteria (OER threshold).
Patch Clustering and Standardization: Group structurally similar patches and select representatives to reduce review burden and maintenance overhead.
CI/CD Integration: Employ OER-based admission gates to filter out chance-passing patches, enhancing reliability in deployment pipelines.
Longitudinal and Comparative Evaluation: Use stability metrics to benchmark LLMs, decoding settings, and model versions, supporting reproducibility and governance.

These protocols are particularly critical in safety-critical domains (finance, healthcare, automotive) and large-scale codebases, where instability can propagate significant risks.

Limitations and Future Directions

The paper is limited to single-file Python problems and a single LLM architecture (GPT-4). Future research should extend to multi-file and multi-language scenarios, alternative LLMs, and more advanced equivalence metrics (e.g., AST-based, semantic similarity). Long-term stability testing and the impact of model version drift are identified as important areas for further investigation.

Conclusion

This work provides a rigorous quantification of LLM instability in automated bug injection and correction, demonstrating that output variability is pervasive, temperature-dependent, and highly task-sensitive. The inability to achieve full determinism, even at zero temperature, highlights fundamental limitations in current LLM architectures for reliable software engineering applications. The proposed stability-oriented protocols offer practical pathways to mitigate these risks, but unsupervised or one-off use of LLMs for bug fixing remains inadvisable in high-stakes contexts. Methodological advances in sampling, selection, and standardization are essential for integrating LLMs into robust software development workflows. The implications extend to model evaluation, governance, and the broader adoption of LLMs in automated program repair.