Cost-optimal Sequential Testing via Doubly Robust Q-learning

Published 13 Apr 2026 in stat.ML, cs.AI, cs.LG, and math.ST | (2604.11165v2)

Abstract: Clinical decision-making often involves selecting tests that are costly, invasive, or time-consuming, motivating individualized, sequential strategies for what to measure and when to stop ascertaining. We study the problem of learning cost-optimal sequential decision policies from retrospective data, where test availability depends on prior results, inducing informative missingness. Under a sequential missing-at-random mechanism, we develop a doubly robust Q-learning framework for estimating optimal policies. The method introduces path-specific inverse probability weights that account for heterogeneous test trajectories and satisfy a normalization property conditional on the observed history. By combining these weights with auxiliary contrast models, we construct orthogonal pseudo-outcomes that enable unbiased policy learning when either the acquisition model or the contrast model is correctly specified. We establish oracle inequalities for the stage-wise contrast estimators, along with convergence rates, regret bounds, and misclassification rates for the learned policy. Simulations demonstrate improved cost-adjusted performance over weighted and complete-case baselines, and an application to a prostate cancer cohort study illustrates how the method reduces testing cost without compromising predictive accuracy.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper presents COST-Q, a framework for cost-optimal sequential testing that uses doubly robust Q-learning to achieve unbiased policy estimation even under model misspecification.
It leverages innovative pseudo-outcomes, path-specific inverse probabilities, and backward induction to balance predictive accuracy with testing costs in adaptive clinical settings.
Simulation and real-world results show reduced predictive loss and enhanced specificity, demonstrating COST-Q’s effectiveness in personalized diagnostic pathways.

Cost-Optimal Sequential Testing via Doubly Robust Q-learning: A Technical Synthesis

Introduction

This paper presents COST-Q ("Cost-optimal Sequential Testing via Doubly Robust Q-learning"), a framework for personalized, history-dependent, cost-sensitive sequential test selection, with a specific focus on retrospective clinical datasets where missingness arises adaptively from prior clinician decisions. The key innovation is a doubly robust Q-learning formulation that enables unbiased policy estimation even when either the acquisition propensity or the contrast model is misspecified. The framework combines path-specific IPW with orthogonal pseudo-outcomes, yielding strong theoretical guarantees and improved performance in simulations and real-world data.

Problem Formulation and Statistical Framework

The sequential decision process involves an initial baseline feature block $X_0$ , always available (cost-free), and $M$ optional test blocks. Actions are represented as sequences $(S_1, ..., S_M)$ , and the valid decision space respects constraints (e.g., tests can't be repeated). Each information state $s$ corresponds to a terminal feature set and cumulative cost $C_s$ .

Due to adaptive acquisition (e.g., additional tests are ordered based on prior results), missingness is informative and must be modeled explicitly. The analysis is built on a sequential Missing at Random (MAR) assumption: at each stage, the acquisition decision depends only on previously observed data—formally, $S_j \perp (X_{-O_j}, Y) \mid \text{Observed History}$ , where $O_j$ denotes the set of observed tests up to stage $j$ .

The task is to learn a policy $\mathbf{d}$ that, at each stage, selects the next test (or terminates) to minimize the expected sum of predictive loss plus cumulative acquisition cost. This is formalized via stage-wise cost-augmented loss and Bellman recursion, with the optimal policy characterized by contrast functions capturing the expected utility of acquiring further information.

Doubly Robust Q-learning Algorithm

COST-Q operationalizes policy learning via a backward, stage-wise Q-learning procedure built upon doubly robust pseudo-outcomes. At each stage:

Pseudo-outcomes are constructed using path-specific inverse probability weights (from the estimated acquisition model) combined with auxiliary contrast predictions. These have the form: $\Phi = \Delta(Z) + w(V; \pi)[T - \Delta(Z)]$ , where $M$ 0 is the conditioning variable, $M$ 1 is the partial label, and $M$ 2 is the normalized path-specific importance weight.
Double robustness is ensured: the estimator is unbiased if either the acquisition (propensity) model or the contrast model is correctly specified.
Cross-fitting is used to avoid overfitting and bias, leveraging $M$ 3-fold sample splits so that nuisance parameter estimation is conducted independently of pseudo-outcome construction.

Backward induction proceeds from the final stage (full data acquired) to the initial state, with at each step the construction and regression of doubly robust pseudo-outcomes targeting the relevant conditional contrasts. The theoretical advantage is robust, consistent estimation of stage-wise contrasts and associated policy rules in the presence of complex, data-dependent missingness.

Statistical Guarantees

Theoretical analysis provides nonasymptotic oracle inequalities for the contrast estimators, showing that estimation error is bounded by the sum of oracle regression risk and second-order nuisance estimation bias (explicit in $M$ 4). Specifically, the following properties are established:

Stage-wise double robustness: Consistency is achieved if either the pathwise propensities or contrast models converge at suitable rates.
Policy regret: The expected regret of the learned policy is linear in the aggregate stage-wise estimation error.
Misclassification bounds: Stage-wise misclassification probabilities are shown to be sublinear under margin conditions.

These rates hold for nonparametric or machine learning regressors as long as stability and cross-fitting are respected, and require only modest smoothness and positivity conditions.

Simulation Experiments

Simulation studies evaluated COST-Q against BOWL, Only-Complete, and one-shot policies under both correct and misspecified nuisance models at varying $M$ 5.

Figure 1: COST-Q delivers superior or competitive loss performance under three nuisance-model scenarios in simulation Scenario 1, especially in moderate-to-large samples.

Figure 2: Simulation Scenario 2 further demonstrates robust performance by COST-Q under both correct and misspecified settings.

Key empirical findings include:

Under correct specification, COST-Q matches or outperforms benchmarks as $M$ 6 increases.
Under nuisance misspecification (erroneous acquisition or contrast models), COST-Q's double robustness yields clear performance advantages (e.g., loss reduction, lower prediction error).
The benefits derive primarily from improved predictive accuracy rather than merely test cost reduction, indicating effective learning of the acquisition–prediction tradeoff.

Application to Prostate Cancer Diagnostic Pathways

COST-Q is applied to the NCI-EDRN PCA3 prostate cancer diagnostic cohort, where the objective is to deploy blood and urine biomarkers in conjunction with baseline clinical risk ( $M$ 7) for effective biopsy recommendations. Adaptive, sequential testing policies are compared on out-of-sample test cases using both cost-augmented loss and discrimination metrics.

COST-Q attains the lowest total loss among data-driven policies, with higher specificity (up to ~60%) at high recall thresholds relative to always-test-all or always-stop references.
COST-Q yields flexible, path-dependent testing strategies (some individuals stop early, others receive one or both expensive tests), dynamically adapting to the initial risk profile.
Figure 3: The distribution of terminal paths under COST-Q stratified by baseline risk, showing adaptive selection of blood and urine tests.

Tables from the manuscript corroborate these findings for prediction loss, average cost, AUC, specificity, and G-mean.

Implications and Future Directions

COST-Q provides a rigorous solution for sequential, cost-sensitive feature/test selection under complex missingness in retrospective datasets. The main methodological implication is that double robustness and path-specific normalization are critical for unbiased, low-regret learning in clinical or other high-stakes settings where full data are rarely available. Practically, the approach is architecture-agnostic, can be optimized for explicit test budgets, and adapts to protocol constraints (e.g., permitted acquisition paths).

Open directions include:

Extending to patient- and context-specific cost structures.
Scaling to larger numbers of diagnostic actions.
Integrating noncoarsened, censored, or unknown missing data mechanisms typical of broader healthcare or RL domains.

Conclusion

COST-Q bridges causal inference, reinforcement learning, and cost-sensitive prediction for clinical decision support. By explicitly modeling the acquisition mechanism and leveraging doubly robust pseudo-outcomes with backward induction, the framework realizes individualized, cost-efficient testing strategies that reduce measurement burden while maintaining or enhancing predictive accuracy. This represents a substantial improvement for observational datasets where test availability is contingent on prior outcomes and decisions.

Markdown Report Issue