LLMs in Software Testing

Updated 10 March 2026

Large language models are systems that use natural language processing to generate, maintain, and evaluate software tests with varying degrees of autonomy.
They integrate formal taxonomies, prompt engineering techniques, and tool interactions to improve test coverage and bug detection across diverse applications.
Empirical evaluations show that models like GPT-4 reach high coverage metrics yet face challenges in complex control flows and require human oversight for accurate test validation.

LLMs are now central tools in the automation and augmentation of software testing tasks, ranging from test case generation to test maintenance and software testing education. Their ability to synthesize, reason, and interact with both code and natural language input makes them uniquely valuable in constructing, maintaining, and evaluating software tests across a variety of domains and at different levels of abstraction. This article synthesizes the state of the art, formal taxonomies, empirical methodologies, and technical challenges of utilizing LLMs in software testing, as established by the primary research literature.

1. Formal Taxonomies and Applications of LLMs in Software Testing

Feldt et al. established a formal taxonomy based on the autonomy of LLM-driven testing agents, indexed by the function $\alpha: \{M_0,\dots,M_4\}\to\{0,\dots,4\}$ , where each mode $M_i$ represents increasing self-direction—from simple code completion (M₀, $\alpha=0$ ) to fully autonomous conversational testing agents (M₄, $\alpha=4$ ) with memory and planning capabilities (Feldt et al., 2023). The taxonomy is characterized as follows:

Mode	Driver	Interaction	Info Sources	Autonomy $\alpha$
M₀	Front-end	No	Local code context	0
M₁	Front-end	No	Examples, templates	1
M₂	Human	Yes	Templates, dialogue	2
M₃	Human, tools	Yes	Tool outputs	3
M₄	Human, LLM, tools	Yes	All above + memory	4

At higher autonomy levels, LLMs benefit from integration with external tools and stateful workflows, enabling dynamic invocation of test-generation or coverage tools (e.g., EvoSuite, JaCoCo). The “SocraTest” three-tier architecture (user, middleware, LLM) exemplifies this, where the LLM not only generates tests but interacts iteratively with middleware to refine strategies and maximize coverage.

Other comprehensive surveys divide the software testing life cycle into unit-test-case generation, test-oracle synthesis, system-test input generation, bug analysis, debugging, and program repair (Wang et al., 2023, Augusto et al., 29 Sep 2025). LLMs are reported as highly effective for unit and system test generation, assertion synthesis, test suite augmentation, and various levels of program analysis and repair.

2. Empirical Evaluation Frameworks and Benchmarks

A set of public benchmarks has been developed to systematically quantify LLMs’ effectiveness in generating and validating test cases. TESTEVAL offers coverage-based evaluation (overall, targeted branch/line/path) across 210 Python programs with diverse cyclomatic complexity. Key metrics are statement and branch coverage ( $C_{\text{overall}}$ and $C_{\text{branch}}$ ) and targeted accuracy ( $A_{\text{line}}$ , $A_{\text{branch}}$ ) (Wang et al., 2024):

SOTA LLMs (e.g., GPT-4o) reached $C_{\text{overall}}\approx 98.7\%$ and $C_{\text{branch}}\approx 97.2\%$ .
Targeted branch/line accuracy remains lower (≈81%) and precise path coverage is limited (best ≈57%).

The GBCV framework generates synthetic programs parameterized by control-flow and variable usage to expose LLM weaknesses on composite logic, arithmetic, and iteration (Chang et al., 5 Feb 2025). Findings include incomplete test rates (GPT-3.5: 32.7%, GPT-4o-mini: 6.1%), strong boundary-value detection, and error rates up to 88.2% for complex conditional branches.

TestBench targets class-level Java methods and provides five quality metrics: syntactic correctness, compilation, runtime correctness, coverage, and mutation kill rates (Zhang et al., 2024), revealing that larger models (e.g. GPT-4) better utilize complex contexts but still exhibit only moderate mutation scores.

3. Methodological Advances in Prompt Engineering and Test Generation

Prompt engineering dominates as the utilization approach (89% of surveyed studies), with zero- and few-shot templates, role instructions, and chain-of-thought (CoT) decomposition shown to boost correctness and coverage (Chu et al., 26 Nov 2025). Context enrichment techniques—embedding relevant code signatures, imports, and dynamic feedback (e.g., uncovered branches, surviving mutants)—are empirically shown to increase compilation and execution rates by up to 14.9 percentage points (Chu et al., 26 Nov 2025, Zhang et al., 2024). Retrieval-augmented generation (RAG) and structured output specification (JSON, Markdown) further reduce hallucinations and error rates (Sami et al., 2024).

Iterative “generate–validate–repair” loops have become standard mechanisms for industrial-grade test usability, elevating raw test pass rates from as low as 24% to over 70% via automated compilation checks and re-prompting with error feedback. Mutation-guided LLM prompting (MuTAP) and hybrid tool–LLM pipelines (e.g., with EvoSuite, KLEE) increase behavior diversity and improve mutation scores up to 93.6% (Chu et al., 26 Nov 2025).

4. Quantitative Results, Metrics, and Performance Limits

Recent systematic evaluations report the following empirical findings:

Fine-tuned, decoder-only LLMs (e.g., DeepSeek-Coder-6B, CodeLlama-7B) outperform prior SOTA by up to 2x across test generation, assertion, and test evolution tasks in Java (passing test rates up to 36%, assertion EM up to 71%) (Shang et al., 2024).
Prompt engineering alone (e.g., GPT-3.5, Llama3 zero-shot) can match or exceed fine-tuned baselines on code generation, but not on precise assertion EM.
On Defects4J, best-in-class LLMs (DeepSeek-Coder-6B) yield a correct generation rate of 33.68%, but only detect 8/163 bugs—precision 0.74%, with high false-alarm rates (135 per true bug).
For mutation testing, LLMs produce mutants with 90.1% higher fault detection (79.1% vs. 41.6% for rule-based), but at the cost of lower compilability and increased duplicate/equivalent mutants (Wang et al., 2024).

In practical deployments, agent-oriented frameworks integrating test generation, call graph visualization, and automated test execution can sustain coverage above 90% across diverse applications with end-to-end runtimes of 80–90 seconds per generation–execution loop (Sherifi et al., 2024).

5. Limitations, Failure Modes, and Open Research Challenges

Despite considerable advances, prominent challenges persist:

Semantic Weakness: LLMs often fail to capture deep program logic—especially in path coverage and complex control flows—even with strong coverage on “happy-path” and boundary inputs (Wang et al., 2024, Chang et al., 5 Feb 2025).
Hallucination and Fault Detection: While fabricated outputs can expose edge cases, unverified hallucinations also introduce spurious tests and incorrect oracles, necessitating tool-based verification and human oversight (Feldt et al., 2023, Chu et al., 26 Nov 2025).
Prompting Sensitivity: Small changes in prompt design yield large variations in output quality; robust promptware validation and self-refinement chains are active areas of research (Augusto et al., 29 Sep 2025).
Benchmark Gaps: Prevalence of data leakage from public benchmarks (e.g., Defects4J) into model training sets risks inflating results. Larger, de-contaminated, and multi-language benchmarks are needed (Shang et al., 2024).
Scalability and Integration: Maintaining up-to-date, context-relevant prompt libraries and integrating LLMs into CI/CD pipelines require automation and human governance frameworks (Santana et al., 20 Oct 2025).
Test Maintenance: Applying LLMs in the test maintenance process (detection of outdated/obsolete cases) in industry is feasible but currently achieves modest precision/recall (e.g., F1=29.3%) (Liu et al., 2024).

6. Roadmaps and Best Practices for Responsible Adoption

Best-practice guidelines stress iterative prompt design, strict human-in-the-loop validation, and the adoption of structured prompt libraries mapped to specific testing artifacts and workflows (Santana et al., 20 Oct 2025). Policy-level recommendations include:

Deploy on-premise models for sensitive code,
Log all prompt–output pairs for traceability,
Combine LLMs with static/dynamic program analysis for test relevance and coverage optimization,
Augment LLM outputs with post-processing for syntax/compilation repair.

In educational settings, advanced prompt engineering and role priming can dramatically boost question/explanation alignment in ISTQB-style domains, enabling virtual tutors and scalable assessment platforms (Ngo et al., 25 Oct 2025).

Leading surveys identify the next research directions: robust hybridization with symbolic methods for path-sensitive coverage, development of fully autonomous testing agents, formalization of test adequacy metrics for LLMs, and expansion into non-functional testing and large-scale, real-world, and multi-agent software systems (Wang et al., 2023, Augusto et al., 29 Sep 2025, Chu et al., 26 Nov 2025).

In summary, LLMs have established themselves as indispensable assets in both academic and industrial software testing, offering measurable gains in efficiency, coverage, and process automation. Their integration into test generation, oracle synthesis, maintenance, education, and mutation testing is now empirically grounded, yet their current limitations on path sensitivity, fault detection, and reproducibility require continued methodological and theoretical advances, particularly in benchmarking, tool integration, and human–AI collaboration (Feldt et al., 2023, Xiong et al., 2023, Sami et al., 2024, Wang et al., 2024, Chang et al., 5 Feb 2025, Junior et al., 2023, Chu et al., 26 Nov 2025).