Papers
Topics
Authors
Recent
2000 character limit reached

TestPrune: Issue-Based Test Minimization

Updated 23 October 2025
  • TestPrune is an automated test suite minimization technique that selects a minimal yet effective set of regression tests by targeting suspicious code regions inferred from issue reports.
  • It leverages LLM-driven predictions to map test coverage and rank tests based on the extent of suspicious code execution, ensuring precise and efficient bug reproduction.
  • TestPrune significantly reduces test suite size from thousands to around 9–11 tests per instance, cutting execution time by 27× and lowering validation costs.

TestPrune is an automated, issue-guided test suite minimization technique developed to enhance the efficiency and effectiveness of regression test selection for software engineering (SWE) issue reproduction and patch validation. In the era of LLM-driven bug repair and code modification, test suites are often too large to fit within LLM context limits, and include many irrelevant or noisy cases that hinder agentic debugging and inflate inference costs. TestPrune addresses this problem by synthesizing fine-grained, minimal sets of regression tests that cover the most likely suspicious code regions as inferred from issue reports, enabling scalable and precision-focused integration with patch generation and validation pipelines.

1. Purpose and Scope

TestPrune focuses on two central software engineering activities:

  • Bug Reproduction: By directly leveraging natural language issue descriptions (including stack traces or code excerpts), TestPrune identifies the relevant code components likely responsible for a reported bug and selects only those regression tests that exercise these components.
  • Patch Validation: After candidate patches are generated (for example, by LLM agents), TestPrune deploys the minimized suite to rapidly validate that correct behavior is preserved, while reducing both cost and LLM context utilization.

The methodology is agnostic to the specific agentic repair pipeline and can be integrated as a preprocessing module within varied debugging platforms.

2. Technical Approach: Issue-Based Test Minimization

The core contribution of TestPrune is an automated “issue-based test-suite minimization” process that prioritizes tests based on their execution coverage of suspicious methods linked to a user-reported issue. This is achieved in several steps:

  1. Suspicious Method Prediction: Given an issue description dd, a LLM is prompted to produce a set of methods F(d)\mathcal{F}(d) suspected to be involved in the bug.
  2. Test Coverage Mapping: Each regression test tjt_j is statically or dynamically instrumented to obtain a coverage vector

vec(covtj)=cj1,cj2,,cjp,\mathrm{vec}(\mathrm{cov}_{t_j}) = \langle c_{j1}, c_{j2}, \ldots, c_{jp} \rangle,

where cjic_{ji} is the number of executed lines in method MiM_i.

  1. Test Selection Objective: The task is to select the smallest subset RRT\mathcal{R} \subset RT from the regression test suite RTRT, such that for every method in F(d)\mathcal{F}(d), all covered code is exercised by at least one test in R\mathcal{R}. This is cast as a weighted minimal hitting set, an NP-hard problem.
  2. Algorithmic Solution: Two heuristic strategies are proposed:
    • Greedy-Additional: Iteratively select the test that covers the largest number of uncovered suspicious lines until all such lines are covered.
    • Greedy-Total: Rank and select tests based on total suspicious line coverage.

In cases where multiple tests tie, an LLM-based tie-break is applied if more than three candidates remain.

This mechanism ensures that TestPrune returns a compact suite that maximally exercises issue-relevant code, with minimal redundancy.

3. Integration with Bug Repair and Agentic Pipelines

TestPrune is designed to serve as a modular front-end to existing reproducibility and repair agents, particularly those based on LLMs. Its most salient points of integration are:

  • Reproduction Test Generation: Integrated into frameworks like Otter, TestPrune provides a focused context for generating or validating fail-to-pass (F→P) tests corresponding to recent issues, leading to higher issue reproduction rates.
  • Patch Validation: In frameworks such as Agentless, minimized regression sets from TestPrune are used to validate patches proposed by agents, more reliably detecting both regressions and unpatched behavior.

Since TestPrune typically selects only 9–11 tests per instance (vs. thousands in a complete suite), it enables full utilization of LLM context while filtering out unrelated or spurious cases that would reduce debugging precision.

4. Empirical Evaluation

On benchmarks SWE-Bench Lite and SWE-Bench Verified, TestPrune yields the following empirical benefits:

  • Issue Reproduction: Integration with Otter increases the F→P (fail to pass) test reproduction rate by 6.2%–9.0% (relative increase versus using a less-focused or full test set).
  • Patch Selection: When coupled with Agentless for validation, the patch selection (issue resolution) rate increases by 9.4%–12.9%.
  • Test Suite Size and Execution Time: The averaged minimized suite size is typically 9–11 (down from thousands), with average test execution runtimes reduced by 27×.
  • Cost: Additional cost per SWE-Bench instance is approximately $0.02 (GPT-4o) or$0.05 (Claude-3.7-Sonnet), which is negligible compared to total pipeline costs (<$1/sample).

The following table summarizes key empirical metrics reported:

Metric TestPrune Baseline (Full Suite) Relative Change
Reproduction Rate ↑ 6.2–9.0% baseline
Patch Resolution Rate ↑ 9.4–12.9% baseline
Suite Size (tests/sample) 9–11 thousands
Execution Time 27×
Cost per Instance (GPT-4o) $0.02 baseline

5. Formulation and Evaluation Metrics

TestPrune introduces specific evaluation formulas to quantify the minimization quality. Given the selected minimized regression test set MRTMRT and a ground-truth set GTGT (i.e., those tests actually exercising buggy lines), the metrics are:

Precision=MRTGTMRT,Coverage Recall=L(MRT)L(GT)\text{Precision} = \frac{|MRT \cap GT|}{|MRT|}, \qquad \text{Coverage Recall} = \frac{|\mathcal{L}(MRT)|}{|\mathcal{L}(GT)|}

where L()\mathcal{L}(\cdot) computes the set of buggy lines covered by the test set. These enable assessment of both the relevance and the sufficiency of the minimized suite for the issue at hand.

6. Broader Implications and Future Directions

TestPrune addresses the context bottleneck in LLM-based debugging by bridging “old” regression knowledge and “new” issue contexts. Its fine-grained minimization assures compatibility with next-generation agentic debugging tools, supports rapid test execution, and increases reliability of patch validation. The method is modular and orthogonal, improving diverse repair pipelines without altering their design.

Future work could expand upon:

  • Enhanced LLM prompting strategies for more granular suspicious method identification.
  • Adaptive minimization algorithms for larger, more heterogeneous codebases.
  • Integration with automated patch suggestion, where minimized rehearsal and validation accelerate the repair cycle.

TestPrune’s approach foregrounds the potential for more targeted, cost-effective, and high-precision regression suite utilization in modern software maintenance.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to TestPrune.