Papers
Topics
Authors
Recent
2000 character limit reached

Test Input Prioritization

Updated 27 December 2025
  • Test input prioritization is a systematic process that orders tests to maximize early fault detection using criteria like APFD and cost constraints.
  • It employs diverse methods including coverage-based, statistical, and machine learning approaches to optimize test execution order.
  • This strategy accelerates feedback in continuous integration and neural network validations, reducing time-to-failure and labeling costs.

Test input prioritization, or test case prioritization (TCP), is the methodological process of ordering test cases (or test inputs) such that those with the highest potential to expose faults, regressions, or crucial application behavior are executed first. The overarching objective is to maximize early feedback under constraints on time, budget, or labeling effort, particularly in large-scale regression testing, continuous integration (CI), and machine learning system validation. This article comprehensively presents the formal foundations, algorithmic methods, objective and evaluation metrics, representative classes of prioritization strategies across domains (software testing, neural network validation, cyber-physical systems), and current empirical best practices.

1. Formal Problem Definition and Objectives

Test input prioritization is typically formalized as follows: given a set of tests or inputs T={t1,t2,,tn}T = \{t_1, t_2, \ldots, t_n\} and a system under test, the goal is to find a permutation π\pi of TT such that a desired objective function is optimized. Canonical objectives include maximizing the Average Percentage of Faults Detected (APFD), minimizing the Average Time to First Failure, or optimizing position-weighted surrogate coverage or uncertainty criteria. In most instances, one seeks to maximize the rate at which faults, coverage, or uncertainty are observed as early as possible in the execution order.

Typical formalizations include:

  • APFD:

APFD=11nmi=1mTFi+12n\mathrm{APFD} = 1 - \frac{1}{nm} \sum_{i=1}^m TF_i + \frac{1}{2n}

where nn is the number of test cases, mm the number of faults, TFiTF_i is the position in the ordering of the first test revealing fault ii.

  • Time- or cost-constrained variants, e.g., cost-aware APFD (APFDcAPFD_c), which weights detection by execution cost.
  • Multi-objective functions: Incorporate combinations such as execution time, model coverage, uncertainty exposure as in UncerPrio's F6(T)=(PET(T),PTR(T),AUM(T),ANU(T))F_6(T) = (\mathrm{PET}(T), -\mathrm{PTR}(T), -\mathrm{AUM}(T), -\mathrm{ANU}(T)) (Zhang et al., 2023).

2. Coverage-Based and Structural Prioritization

A foundational class of methods relies on program structural coverage data to guide prioritization. Algorithms are generally greedy:

  • Total and Additional Greedy Algorithms:
    • Total: Rank by number of code elements (statements, branches) covered (Beena et al., 2013).
    • Additional: Repeatedly select tests covering the most as-yet-uncovered elements (the "additional" coverage) (Li et al., 2022).
    • Accelerated Greedy Additional (AGA) leverages optimized data structures (forward/inverse indices) and empirical iteration bounds to reduce prioritization complexity to O(kmn)O(kmn) (kk small) (Li et al., 2022).
  • Combinatorial Coverage (CCCP):

Considers not just single-unit but λ-wise code unit combinations. At each step, select the test contributing maximal new λ-wise combinations (Huang et al., 2020). Empirically, λ=2 yields superior APFD over all baselines with cost comparable to greedy additional algorithms.

  • Feature-Oriented Prioritization:

For highly configurable systems (HCSs), tests are mapped statically to code features (e.g., via preprocessor directives). Test priority is proportional to the number of features covered, ensuring early detection of failures attributable to changed features (Mendonça et al., 2024).

3. Statistical, Heuristic, and Machine Learning Approaches

In modern practice, prioritization frameworks frequently use statistical histories, diversity heuristics, and machine-learned ranking models:

  • History-Based and Diversity-Based Methods:
    • History (HBTP): Tests weighted by recency/frequency of past failures with decaying weights. Even a single-build history improves APFD over random by 10–20 points (Haghighatkhah et al., 2018).
    • Diversity (DBTP): Use Normalized Compression Distance (NCD) or related metrics on test implementations. NCD-Multiset is optimal for "cold start" (no history), and hybrid schemes (history within diversity clusters) achieve up to 81% APFD (Haghighatkhah et al., 2018).
  • Neural and Metric Learning Models:
    • Embedding Networks: Learn vector representations of code files and tests; ranking is by inner-product similarity between embeddings for changed files and tests (Lousada et al., 2020). NNE-TCP achieves APTD ≃ 0.70, exceeding random or static baselines.
    • Regression DNNs: DeepOrder predicts real-valued priorities using test execution/failure history, outperforming reinforcement-learning (Retecs/RETECS) and history-based pipelines on large-scale CI logs in both NAPFD and time-effectiveness (Sharif et al., 2021).
    • Reinforcement Learning: Retecs formulates prioritization as an MDP, learning via rewards informed by test verdict histories, durations, and achieving high NAPFD after relatively brief online training periods (Spieker et al., 2018).
  • Ensemble and Rank Aggregation:
    • EnTP aggregates multiple ranked lists from 16 standalone heuristics using diversity-based selection (top-75% by Kendall-tau distance) and then consensus (Borda, Kemeny-Young, harmonic/median) (Mondal et al., 2024).
    • This ensemble consensus outperforms all single heuristics by 2–4 points in APFD/APFD_c and is robust to code change heterogeneity.

4. Domain-Specific and Uncertainty-Aware Strategies

Complex domains and systems necessitate specialized prioritization criteria:

  • Uncertainty-Aware Multi-Objective TCP:
    • In cyber-physical system (CPS) testing, test inputs are annotated with uncertainty metadata. Multi-objective search algorithms (SPEA2, NSGA-II, etc.) optimize jointly for test cost, model coverage, and measures such as average uncertainty metric, number of uncertainties observed (Zhang et al., 2023).
    • Empirically, optimizing PET, PTR, AUM, and ANU yields highest early uncertainty observation.
  • Input Prioritization for ML/NNS:
    • For neural-network testing, input prioritization leverages internal "sentiment" metrics: confidence (softmax entropy), uncertainty (MC dropout), and surprise (distance-based activation adequacy) (Byun et al., 2019). All three substantially reduce the labeling cost required for early fault detection (APFD 75–95%).
    • Replication studies show vanilla softmax, entropy, and Gini impurity (DeepGini) perform equivalently and dominate coverage-based methods for test input prioritization and active learning (Weiss et al., 2022).
  • Prioritizing Adversarial Inputs (LBT):
    • Learning-Based Testing (LBT) trains a surrogate model to mimic the model under test. Mutation sensitivity, measured by surrogate prediction flips, prioritizes adversarial candidates; selection is determined via sequential hypothesis test (Wald’s SPRT) (Rahman et al., 28 Sep 2025). LBT consistently outperforms white-box and confidence/uncertainty-based baselines in early adversarial fault detection.
  • Graph Structured Data and GNNs (GraphRank):
    • For GNNs, GraphRank fuses model-aware (entropy, margin) and model-agnostic (node features, degree) attributes, aggregates over local neighborhoods, and prioritizes using an XGBoost ranker. Labeling is performed iteratively in rounds (Yang et al., 20 Dec 2025). This approach improves average test relative coverage by 2–7 percentage points over the best alternative baselines.

5. Specialized Prioritization: Process, Model-Based, and Fuzzy Logic Approaches

  • Model-Based Process Test Prioritization (PPT):
    • Given a weighted directed multigraph, with priorities on edges (“business value”), the Prioritized Process Test algorithm generates test sets to maximize early coverage of high-priority transitions under user-specified coverage depth (TDL = test-depth-level) (Bures et al., 2019).
  • Fuzzy Inference Systems:
    • Fuzzy logic TCP incorporates linguistic variables (e.g., execution time, failure rate) and expert-derived rules, producing priority scores for each test via Mamdani inference and centroid defuzzification. Heuristics such as “promotion” for recently updated tests are applied post-hoc (Karatayev et al., 2024). Experiments confirm equivalence to manual expert ordering for unique defect detection.
  • Genetic Algorithms for Scenario-Based TCP:
    • Paths from UML activity or state-chart diagrams are encoded as chromosomes. Fitness combines node information-flow metrics and stack-based weights, focusing GA search on the most structurally complex or deeply embedded control paths (Sharma et al., 2014).

6. Lightweight and Language-Specific Prioritization Schemes

  • Static Lexical IR-Based TCP:
    • In dynamic languages (e.g., Python), static analysis or coverage collection may be infeasible. Instead, IR-style methods (TF-IDF, BM25) score lexical similarity between change region and test code (Mattis et al., 2020). Empirically weighted schemes (TF-PREC) further boost effectiveness. These methods cut time-to-failure by 2–10× versus random order and require only milliseconds for typical regression suites.
  • Application Context and Tool Support:
    • Many TCP algorithms are designed for seamless CI pipeline integration, often with support for time or resource budgets, streaming code changes (enabling online or batch adaptation), and parallel execution groups (EnTP consensus) (Mondal et al., 2024).

7. Evaluation Methodologies and Empirical Findings

All approaches converge on empirically benchmarking prioritization methods using metrics such as APFD/APFD_c or Average Percentage of Transition Detection (APTD) in CI. Empirical analysis spans application domains (compiled, script-based, ML, cyber-physical), and subject sizes from hundreds to millions of tests (see NNE-TCP’s 4,000-test industrial case, or GraphRank’s million-node GNN benchmarks).

Key results:

  • Accelerated greedy/greedy-additional, combinatorial coverage, and ensemble rank aggregation remain dominant for traditional regression testing.
  • Softmax-based uncertainty metrics are overwhelmingly effective and computationally superior for neural-input prioritization, routinely outperforming neuron/surprise-coverage methods.
  • Model-driven or learning-based methods (DeepOrder, GraphRank, LBT) unlock advances for CI-scale or ML system settings, supporting incremental adaptation, non-trivial histories, and expensive oracles.

References

  • (Beena et al., 2013) Code Coverage Based Test Case Selection and Prioritization (Beena & Sarala, 2013)
  • (Li et al., 2022) AGA: An Accelerated Greedy Additional Algorithm for Test Case Prioritization (Zhao et al., 2022)
  • (Sharif et al., 2021) DeepOrder: Deep Learning for Test Case Prioritization in Continuous Integration Testing (Yao et al., 2021)
  • (Zhang et al., 2023) Uncertainty-Aware Test Prioritization: Approaches and Empirical Evaluation (Kebiri et al., 2023)
  • (Yang et al., 20 Dec 2025) Toward Efficient Testing of Graph Neural Networks via Test Input Prioritization (Dong, 2025)
  • (Lousada et al., 2020) Neural Network Embeddings for Test Case Prioritization (Kota et al., 2020)
  • (Haghighatkhah et al., 2018) Test Prioritization in Continuous Integration Environments (Kiani et al., 2018)
  • (Byun et al., 2019) Input Prioritization for Testing Neural Networks (Kim et al., 2019)
  • (Weiss et al., 2022) Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study) (Müller & Sen, 2022)
  • (Huang et al., 2020) Regression Test Case Prioritization by Code Combinations Coverage (Dong et al., 2020)
  • (Bures et al., 2019) Prioritized Process Test: An Alternative to Current Process Testing Strategies (Přikryl et al., 2019)
  • (Mondal et al., 2024) On Rank Aggregating Test Prioritizations (Mondal & Nasre, 2024)
  • (Rahman et al., 28 Sep 2025) Learning-Based Testing for Deep Learning: Enhancing Model Robustness with Adversarial Input Prioritization (Dong et al., 2025)
  • (Mendonça et al., 2024) Feature-oriented Test Case Selection and Prioritization During the Evolution of Highly-Configurable Systems (Sampaio et al., 2024)
  • (Karatayev et al., 2024) Fuzzy Inference System for Test Case Prioritization in Software Testing (Karatayev et al., 2024)
  • (Sharma et al., 2014) Applying Genetic Algorithm for Prioritization of Test Case Scenarios Derived from UML Diagrams (Chaurasia & Sain, 2014)
  • (Mattis et al., 2020) Lightweight Lexical Test Prioritization for Immediate Feedback (Mattis & Hirschfeld, 2020)

Test input prioritization is a rigorously formalized and diversely instantiated research domain. While coverage-based greediest and uncertainty (in neural models) remain the most empirically validated, ongoing research focuses on domain-specific tailoring, ensemble aggregation, and sample- and label-efficient active prioritization under constraints dictated by modern software development and deploying increasingly opaque ML systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Test Input Prioritization.