Test Case Minimization (TCM) Techniques

Updated 2 July 2026

Test Case Minimization (TCM) is the process of reducing the size and cost of test suites while preserving essential coverage and fault detection.
TCM employs diverse methods such as set cover, greedy heuristics, ILP, reinforcement learning, and quantum annealing to optimize test selection.
TCM techniques enhance efficiency in program testing, circuit diagnosis, and specification validation, yielding measurable improvements in runtime and fault detection rates.

Test Case Minimization (TCM) is the process of reducing the cardinality or total execution cost of a test suite while preserving coverage, fault detection, or other validation properties under given constraints. TCM spans program testing, circuit diagnosis, specification validation, proof assistants, and AI-generated test regimes, and is addressed via combinatorial, logical, heuristic, history-based, and quantum/annealing-based algorithms. Research in arXiv literature details both black-box and white-box TCM for code, models, and requirements, and extends TCM to settings with multi-criteria objectives, scalable embeddings, and hardware acceleration.

1. Formal Problem Definitions and Core Objectives

All TCM approaches specify a core minimization objective, with coverage or detection constraints. In canonical code-based TCM, given a test suite $T = \{t_1,\dots,t_n\}$ and a system artifact (code, requirement set, specification), the aim is to select a subset $T' \subseteq T$ , $|T'| \leq B \cdot |T|$ for budget $B \in (0,1]$ , such that $T'$ preserves (a) structural coverage (statements, transitions, requirements), (b) known-fault detection, and/or (c) test diversity, depending on setting (Asif et al., 25 May 2026, Pan et al., 26 May 2025, Gu et al., 2024, Swain et al., 2012).

For coverage-based TCM, the classical set-cover model is dominant: $\text{minimize } |T'| \qquad \text{subject to } \bigcup_{t \in T'} Covered(t) = \mathcal{C}$ where $Covered(t)$ is the set of features (e.g., statements) exercised by $t$ , and $\mathcal{C}$ is the full set of features.

In requirement-based TCM, given requirements $R = \{r_1,\dots,r_m\}$ and traceability $T' \subseteq T$ 0, the problem becomes: $T' \subseteq T$ 1 with hard constraint of full requirement coverage (Pan et al., 26 May 2025).

Multi-criteria TCM generalizes to enforcing full statement/fault coverage with a similarity/dissimilarity regularization: $T' \subseteq T$ 2 where $T' \subseteq T$ 3 if both $T' \subseteq T$ 4, and $T' \subseteq T$ 5 is a pairwise similarity score (Gu et al., 2024).

Boolean or logical TCM reframes the problem in terms of entailment among test results under the system requirements, computing an optimal sequence of test cases where redundant tests are dropped if their outcome can be inferred from prior results (Morciniec et al., 2016).

2. Dominant Methodological Paradigms

Set Cover and Greedy Heuristics

Early TCM research primarily models the problem as a set cover instance. For example, in circuit diagnosis, a test-fault matrix for all possible faults is constructed; a greedy algorithm selects tests maximizing the product $T' \subseteq T$ 6 (number of zeros/ones per test row), yielding $T' \subseteq T$ 7 core patterns out of exponentially many possibilities (Thamarai et al., 2010). In state-based software, node/transition/path coverage is computed and minimized via a greedy hitting-set approach (Swain et al., 2012).

Similarity-Based and Evolutionary Search

Similarity-based black-box TCM dispenses with code instrumentation and instead uses syntactic, lexical, or semantic representations of test code. Modern approaches embed test cases using pre-trained LLMs (CodeBERT, UniXcoder), compute pairwise similarities (Cosine, Euclidean), and use Genetic Algorithms (GA) to select a diverse subset, typically minimizing intra-set similarity (Pan et al., 2023). AST-based similarity and evolutionary search (ATM) leverage multi-view structural measures—top-down, bottom-up common subtrees, edit distance—within GA or NSGA-II optimization (Pan et al., 2022).

Change-Proneness and Temporal Risk

Change-proneness-based minimization (CTM, MCTM, TRTM) uses version-control history to guide test selection. At the class or method level, change extent or frequency ( $T' \subseteq T$ 8, $T' \subseteq T$ 9) is used as a proxy for fault-proneness. TRTM further applies exponential temporal attenuation to capture recency: more recent code changes contribute more to risk scores ( $|T'| \leq B \cdot |T|$ 0), improving detection rates in dynamic code bases (Asif et al., 25 May 2026, Siam et al., 13 May 2026).

ILP, RL, and Quantum Annealing

Large-scale, multi-constraint TCM instances are formalized as ILPs with auxiliary binary variables, supporting composition of hard constraints (coverage, detection) and soft objectives (similarity/diversity) (Gu et al., 2024). When ILPs become intractable, reinforcement learning (e.g., PPO) is used to greedily build subsets, guided by bipartite embeddings and reward structures encoding coverage and diversity. Quantum-guided models formulate TCM as QUBO, supporting efficient annealing (classical or quantum) for fast, global search in large spaces (Zhang et al., 19 Nov 2025, Wang et al., 2023).

3. Aggregation, Scoring, and Selection Mechanisms

A unified structure in modern TCM research is the decoupling of (i) feature scoring and (ii) test-case aggregation/selection:

Feature Scoring: Each code unit is assigned a score measuring risk, change-proneness, or similarity to other units. CP and temporal risk models use event-based scoring derived from commit history, classifying code as more or less fault-prone (Asif et al., 25 May 2026).
Test Case Aggregation: For a test case $|T'| \leq B \cdot |T|$ 1 invoking classes $|T'| \leq B \cdot |T|$ 2, aggregate their scores via statistical summaries (arithmetic mean, geometric mean, harmonic mean, median) to derive a unified test score (Asif et al., 25 May 2026, Siam et al., 13 May 2026).
Selection: Ranked test cases are selected up to a fixed budget $|T'| \leq B \cdot |T|$ 3, or until special constraints (e.g., coverage completeness) are met (Pan et al., 26 May 2025).

Genetic algorithms and ILP/RL-based search are similarly parameterized by diversity or central tendency of the test feature scores, with selection and mutation operators specifically designed to enforce budget, diversity, and constraint satisfaction (Pan et al., 2023, Gu et al., 2024, Pan et al., 26 May 2025).

4. Evaluation Metrics and Empirical Results

TCM effectiveness is reported using multiple domain-specific metrics:

Accuracy: $|T'| \leq B \cdot |T|$ 4, where $|T'| \leq B \cdot |T|$ 5 is the set of fault-revealing tests retained, $|T'| \leq B \cdot |T|$ 6 is that in the original suite (Asif et al., 25 May 2026, Siam et al., 13 May 2026).
Fault Detection Rate (FDR): $|T'| \leq B \cdot |T|$ 7; $|T'| \leq B \cdot |T|$ 8 if any fault-revealing test is preserved in reduced suite for version $|T'| \leq B \cdot |T|$ 9 (Asif et al., 25 May 2026, Siam et al., 13 May 2026, Pan et al., 2023).
Coverage Retention: Fraction of statements/requirements/edges/nodes still covered.
Reduction Ratio: Proportion of tests eliminated; reduction up to 99% in circuit TCM is achievable (Thamarai et al., 2010).
Cost/Runtime: Minimization time per version, scalability wrt test-suite size or commit history; RL/ILP and QUBO-based methods yield substantial improvements in large-scale settings (Gu et al., 2024, Wang et al., 2023, Zhang et al., 19 Nov 2025).
Statistical Significance: Nonparametric tests (Wilcoxon, Fisher’s exact test) and effect sizes (odds ratios, Cliff’s $B \in (0,1]$ 0) are used for controlled comparisons (Asif et al., 25 May 2026, Siam et al., 13 May 2026).

Table: Representative results at 50% minimization budget (Accuracy / FDR / Runtime):

Method	Accuracy	FDR	Time (min/version)
TRTM	0.72	0.75	0.82
CTM	0.66	0.69	1.04
MCTM	0.93	0.94	0.98
ATM	0.67	0.81	45.8
LTM	0.71	0.84	2.60

These empirical results are consistent across multi-project, multi-version studies on Defects4J and related datasets (Asif et al., 25 May 2026, Siam et al., 13 May 2026, Pan et al., 2023, Gu et al., 2024).

5. Specialized Domains and Extensions

Requirement-Based and Natural-Language TCM

RTM explicitly addresses requirement-traceable, natural language test suites. Minimization enforces strict requirement-coverage constraints, employing vector-space embeddings (TF–IDF, USE, LongT5, Titan) and multiple GA initialization heuristics; the most effective configurations achieved $B \in (0,1]$ 1 at 50% budget with TF–IDF and Cosine similarity. The effect of redundancy level is analyzed as a critical determinant of achievable FDR (Pan et al., 26 May 2025).

Logical, Deductive, and Specification-Oriented TCM

Logical TCM derives full inter-test inferences by quantifier elimination and resolution on Boolean encodings of requirements and test-case outcomes, supporting the automatic construction of test plans that maximize drop rate via optimal ordering (satisfying the maximum number of inferred redundancies per user expectation). Optimality holds in the sense of Horn-clause implication (Morciniec et al., 2016).

For specification narrowing, TCM is cast as the problem of synthesizing a minimal distinguishing test suite for candidate specifications; an optimal PM-SAT encoding minimizes suite cardinality in $B \in (0,1]$ 2 constraints, but scales only to moderate $B \in (0,1]$ 3 (dozens) (Cunha et al., 24 Nov 2025).

Test-Case Reduction for Debugging and Proof Assistants

Delta debugging (ddmin), hierarchical delta debugging (HDD), and hoisting variants support syntactic and tree-structured test-case reduction, applicable to failure-inducing programs and proof scripts. Extensions leveraging syntactic/semantic models accelerate reduction (28–57% faster) while preserving near-optimal minimal size (Gharachorlu et al., 2021, Vince et al., 2021, Gross et al., 2022).

Domain-specific minimizers for gate-level netlists (DAGs) employ structure-aware reductions (e.g., RemovePI, RemovePO, SubstituteGate), eliminating superfluous subcircuits in $B \in (0,1]$ 4 worst-case oracle calls, outperforming ddmin by $B \in (0,1]$ 5 in call count and runtime (Lee et al., 2022).

Quantum and Reinforcement-Learning-Based Optimization

Quantum annealing (QA) and RL accelerate large TCM instances via QUBO formulations and maskable-PPO agents, respectively, maintaining constraint satisfaction and optimizing diversity at scale. RL agents trained on bipartite embeddings and reward functions matching the ILP objective attain runtime scalability up to $B \in (0,1]$ 6 tests and $B \in (0,1]$ 7 edges (Gu et al., 2024), while bootstrap QUBO/QA delivers quantum-classical co-optimization for minimal, high-quality suites, especially for AI-generated test regimes (Wang et al., 2023, Zhang et al., 19 Nov 2025).

6. Limitations, Threats to Validity, and Future Directions

Current TCM techniques are subject to the following limitations:

Construct Validity: Version-control or Git-derived metrics may include non-functional code changes (refactors, formatting) that do not map directly to increased fault-proneness (Siam et al., 13 May 2026). Defects4J and similar regimes contain single-fault benchmarks, while multi-fault and dynamic-bug settings pose unresolved challenges.
Internal Validity: Static dependency analysis (call graphs) may miss dynamic behaviors (reflection, late binding, dependency injection) (Asif et al., 25 May 2026). Genetic algorithms are stochastic over runs; reproducibility requires averaging, and results for LLM-based/embedding-based methods depend on tokenization and model vocabulary (Pan et al., 2023).
External Validity: Most current empirical results are for Java and Defects4J; cross-language generalization, impact on industrial, multi-fault repositories, and adaptation to other platforms (e.g., Python, C#, safety-critical systems) are explicitly named as future work (Asif et al., 25 May 2026, Siam et al., 13 May 2026, Pan et al., 26 May 2025).
Computational Scaling: ILP/PM-SAT optimal methods are exponential in the number of candidates or test cases, limiting practical use to moderate instances (Cunha et al., 24 Nov 2025). As suite size or constraint complexity grows, RL and quantum-based solvers become more attractive, but current quantum hardware is constrained in coupler connectivity and noise (Wang et al., 2023, Zhang et al., 19 Nov 2025).
Aggregation Instability: Certain statistical aggregators (e.g., Harmonic Mean in black-box risk scoring) show instability across projects; geometric mean is reported as the most robust across empirical regimes (Asif et al., 25 May 2026).

Ongoing research addresses:

Adaptive, project- or context-specific temporal parameters in risk modeling (Asif et al., 25 May 2026).
Statement- or block-level change-proneness and hybrid models fusing risk with similarity (Siam et al., 13 May 2026, Pan et al., 2023).
Multi-criteria, multi-objective TCM incorporating requirements, functional, and non-functional objectives (Gu et al., 2024, Pan et al., 26 May 2025).
Integration of dynamic coverage information and cross-repository scaling.
Automatic parameter tuning (e.g., penalty-weights in QUBOs, embedding hyperparameters in RL agents) and further acceleration with sparse data structures.

7. Summary Table: Leading Approaches and Empirical Characteristics

Approach	Highlights	Median/Mean FDR	Time per version	Key Strengths
TRTM (Temporal risk, black-box, call-graph)	Temporal decay, class-level change, black-box	0.75	0.82 min	Scalable, 20% faster than CTM
MCTM (Method change-proneness, call-graph)	Method-level, static analysis	0.94	0.98 min	High FDR, scalable
ATM (AST/GA, code similarity)	AST similarity, GA	0.81	45.8 min	Precise, slow
LTM (LLM embedding, GA, black-box)	Code LMs, cosine div.	0.84	0.82 min	Fast, LLM-based, scalable
RTM (requirement-nlp/GAs, coverage const.)	NL test, TF–IDF/GA, coverage	0.86 (50% budget)	-	Coverage & diversity
TripRL (ILP + RL, multi-crit. coverage/sim.)	ILP, RL for scale, embeddings, mutation score	1.00 (coverage/fault)	<47 min (large suite)	Linear scaling, high diversity
BootQA (QUBO, quantum/classical, real props)	Bootstrap, QPU, multi-metric	-	0.17–3.66s	Large-scale, hardware accel.

Empirical detail: TRTM, MCTM, LTM, and ATM all evaluated on Defects4J/Java; RTM on industrial NL test suites; TripRL scales to $B \in (0,1]$ 8 tests, maintaining 100% coverage and fault retention (Asif et al., 25 May 2026, Siam et al., 13 May 2026, Pan et al., 2023, Pan et al., 26 May 2025, Gu et al., 2024).

Test Case Minimization is a data- and constraint-driven optimization problem with efficient practical algorithms now available for a range of settings: coverage-driven, requirement-annotated, or code-similarity-based; time-aware, history-aware, or risk-aware selection; and scalable embedding-based or quantum/reinforcement hybrid approaches for large test regimes. Integration of structural, historical, and semantic features remains active research, as does scaling for massive test suites, multi-fault coverage, and increasingly complex software supply chains.