Papers
Topics
Authors
Recent
2000 character limit reached

Execution-Free Coverage-Guided Testing

Updated 30 December 2025
  • Execution-free coverage-guided testing is a methodology that uses LLMs to statically predict code coverage, eliminating the need for runtime instrumentation.
  • It employs predictive models and adaptive two-phase feedback to prioritize tests that maximize new coverage and trigger runtime errors.
  • Empirical results show reduced computational overhead and improved error discovery rates compared to traditional execution-based fuzzing methods.

Execution-free coverage-guided testing is a class of software testing methodologies that direct test generation or selection using estimates of code coverage, but without requiring code execution or runtime instrumentation. These approaches address the substantial computational overhead characteristic of traditional (execution-based) coverage-guided fuzzing and enable efficient automated error detection—especially in contexts where running code is infeasible or undesirable. Modern execution-free frameworks leverage LLMs and predictive methods to statically estimate test effectiveness, driving new advances in both static analysis and AI-driven software engineering.

1. Conceptual Foundations

Coverage-guided testing traditionally refers to techniques where test generation is adaptively driven by feedback about which code regions have been exercised by previous inputs. Classical implementations rely on dynamic instrumentation of binaries or source code to obtain accurate per-test coverage data. However, such approaches are limited by the high cost of repeated code execution and tracing, particularly given that only a small fraction of test cases actually lead to new coverage increases. As observed empirically, fewer than 1 in 10,000 inputs typically increases coverage after initial fuzzer convergence, and this rate decays exponentially over time (Nagy et al., 2018).

Execution-free coverage-guided testing bypasses dynamic execution, substituting predictive models for runtime feedback. Formally, this requires a static predictor fθf_\theta capable of estimating the coverage vector for a program PP under test case TT, fθ(P,T,I)[0,1]nf_\theta(P, T, I) \in [0,1]^n, where nn is the number of statements or code units, and II is the test input (Tufano et al., 2023). The predictor enables testers to prioritize, score, or mutate inputs based on estimated (rather than measured) coverage, facilitating aggressive test exploration with minimal computational resource use.

2. Frameworks and Methodologies

Modern execution-free coverage-guided testing frameworks fall into two primary architectures:

  • Predictive Coverage-Guided Test Selection: These methods employ a static model, often based on LLMs, to estimate per-statement or per-branch coverage for each test candidate. The most promising test (maximizing expected new coverage) is selected for further exploration. Mutation guidance and prioritization are steered by coverage predictions. For example, in (Tufano et al., 2023), LLMs are prompted with program code and candidate tests to annotate whether each source line would be executed, enabling surrogate coverage-guided search.
  • Execution-Free, Multi-Agent Test Generation: Frameworks such as Cerberus (Dhulipala et al., 24 Dec 2025) deploy multiple LLM-based agents—a test case generator (TCG) and a predictive executor (PE)—operating in a feedback loop. The TCG proposes test candidates; the PE simulates their execution, predicting both the coverage set cov^(P,T)\widehat{\mathrm{cov}}(P, T) and any runtime exceptions. The testing process is organized into phases, adaptively switching LLM prompting strategies between maximizing coverage gain and maximizing error discovery, depending on estimated coverage progression.

Key elements of these frameworks include surrogate coverage metrics (e.g., C(T,P)=tTcov^(P,t)SPC(T, P) = \frac{|\bigcup_{t \in T} \widehat{\mathrm{cov}}(P, t)|}{|S_P|}, where SPS_P is the set of executable statements), two-phase prompting schemes, and test scoring functions optimized for either coverage expansion or error triggering (Dhulipala et al., 24 Dec 2025).

3. Predictive Models and Coverage Estimation

Predicting precise per-test code coverage without execution is a challenging learning problem. Approaches in (Tufano et al., 2023) formalize the coverage prediction task for LLMs: they task a model with outputting, for each line/statement of PP, whether it would be executed by test TT. Training uses datasets such as COVERAGEEVAL—comprised of (program, test, coverage vector) triplets derived from real code and test executions.

Prompting strategies with LLMs—such as few-shot, zero-shot, or multi-shot examples of code annotated with coverage outcomes—enable high statement-level prediction accuracy (e.g., up to 90.7% for GPT-4 in one-shot mode) but lower accuracy for branch-sensitive lines (∼22%). These predictive capabilities are then plugged into test selection or generation loops, as illustrated in the following pseudocode (from (Tufano et al., 2023)):

1
2
3
4
5
for each candidate test T:
    pred_coverage = f_theta(P, T, I)
    score[T] = sum_j ((1 - Covered[j]) * pred_coverage[j])
Select T* = argmax_T score[T]
Update Covered[] based on pred_coverage[T*]

As predictions are imperfect, hybrid approaches may combine static predictors with occasional true instrumentation to recalibrate predictions on difficult or high-value code paths.

4. Two-Phase Feedback and Multi-Agent Systems

Execution-free methodologies benefit from feedback-driven strategies originally developed for execution-based fuzzing. In Cerberus (Dhulipala et al., 24 Dec 2025), the testing algorithm is organized into two main phases:

  1. Joint Coverage and Error Discovery: The test generation agent is prompted to maximize both coverage gain and exception triggering. The score for a candidate tt is given by S1(t)=cov^(P,t)C+γ1{error(t)}S_1(t) = |\widehat{\mathrm{cov}}(P, t) \setminus \mathcal{C}| + \gamma \cdot \mathbf{1}_{\{\text{error}(t)\}}, where C\mathcal{C} is the cumulative coverage and γ\gamma is a weighting parameter.
  2. Error Maximization: After predicted coverage plateaus or exhausts, the system shifts exclusively to maximizing error discovery: S2(t)=1{error(t)}S_2(t) = \mathbf{1}_{\{\text{error}(t)\}}. This adaptive phase-switching enables greater test diversity and higher bug-finding rates compared to static prompting.

Each iteration involves generating tests, deduplicating, simulating coverage and exceptions, updating the cumulative coverage set, and resuming unless a time budget or saturation threshold is met.

5. Empirical Evaluation and Comparative Metrics

Frameworks like Cerberus demonstrate notable empirical benefits. On Java and Python benchmarks, Cerberus attains higher error trigger rates (ETR), bug discovery rates (BDR), and superior coverage with orders of magnitude fewer tests compared to baseline execution-based fuzzers such as Jazzer and LLM-based methods like Fuzz4All (e.g., Cerberus: ETR 33.3% with 9 tests versus Jazzer: ETR 9.6% with ∼3.1M tests on Java snippets) (Dhulipala et al., 24 Dec 2025). Evaluation metrics include:

  • Statement/Branch Accuracy: Model precision in coverage prediction (statement-level accuracy ∼90%; branch-level ∼20–22%).
  • Exact-Match: Fraction of instances where predicted and actual coverage vectors are identical.
  • Precision/Recall/F1 (for exception prediction): E.g., Cerberus achieves precision 0.72, recall 0.56, F1 0.63 for Java; precision 0.743, recall 0.798, F1 0.770 for Python (Dhulipala et al., 24 Dec 2025).
  • Coverage Plateau Curves: Number of candidate tests versus predicted code coverage progression.

These results indicate that predictive, execution-free methods can often match or outperform conventional coverage-guided fuzzers in several software testing regimes.

6. Limitations and Directions for Further Research

The main constraints of current execution-free coverage-guided testing approaches include:

  • Prediction Noise: LLM-based coverage estimators exhibit suboptimal accuracy for complex conditional logic, leading to possible coverage misestimation.
  • Generality: Predictive methods may degrade on codebases using unfamiliar libraries or exhibiting high complexity; results are currently best for code snippets up to a few hundred lines.
  • Dependence on LLM Stability: Output variance and hallucination risk require mitigation, usually via low-temperature sampling or fallback to direct execution on stable codebases.
  • Scalability and Cost: Per-query API costs and inference times (e.g., ≈5 s per code snippet on Cerberus) may impact test throughput for large-scale projects (Dhulipala et al., 24 Dec 2025).

Potential research directions encompass improved static branch/edge coverage prediction, hybrid online-offline calibration schemes, integration with CI pipelines, multi-language and web-page coverage support, and reduction of predictor error via specialized architectures or tailored pretraining (Dhulipala et al., 24 Dec 2025, Tufano et al., 2023). Extending beyond LLMs, exploration of SAT/SMT-based path estimation is suggested as the next step for truly execution-free coverage filtering—precomputing per-block predicates enabling mutation rejection entirely statically (Nagy et al., 2018).

7. Relationship to Traditional and Hybrid Coverage-Guided Fuzzing

Execution-free coverage-guided testing is directly motivated by the limitations of traditional execution-based fuzzing, where coverage tracing remains the dominant overhead, especially as the fraction p(t)p(t) of coverage-increasing tests approaches zero. Solutions such as UnTracer (Nagy et al., 2018) mitigate this by encoding the coverage frontier as software interrupts in the binary, filtering out most non-coverage-increasing tests and tracing only the rare “frontier-crossing” events. Analytical models show that UnTracer's average overhead dips below 1% within an hour and approaches zero at 24h, as opposed to 36% (AFL-Clang) to over 600% (AFL-QEMU) for always-instrumented approaches.

Hybrid schemes could start with full tracing, switching to execution-free or predictive methods once the coverage-increasing rate falls below a calibrated threshold (Nagy et al., 2018). This approach offers a continuum, balancing empirical accuracy, computational cost, and code-under-test context.

Approach Coverage Feedback Source Per-Test Overhead Coverage Accuracy
AFL-QEMU/Clang Runtime, full trace High (36–612%) Ground-truth
UnTracer Oracle + trace for increments ~0–1% (after 1–24h) Ground-truth
LLM Predictive Static prediction, no exec <<1% 85–91% (statements)

References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Execution-Free Coverage-Guided Testing.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube