Online Learning of Optimal Sequential Testing Policies (2509.03707v1)

Published 3 Sep 2025 in cs.LG

Abstract: This paper studies an online learning problem that seeks optimal testing policies for a stream of subjects, each of whom can be evaluated through a sequence of candidate tests drawn from a common pool. We refer to this problem as the Online Testing Problem (OTP). Although conducting every candidate test for a subject provides more information, it is often preferable to select only a subset when tests are correlated and costly, and make decisions with partial information. If the joint distribution of test outcomes were known, the problem could be cast as a Markov Decision Process (MDP) and solved exactly. In practice, this distribution is unknown and must be learned online as subjects are tested. When a subject is not fully tested, the resulting missing data can bias estimates, making the problem fundamentally harder than standard episodic MDPs. We prove that the minimax regret must scale at least as $\Omega(T^{{\frac{2}{3}})$,} in contrast to the $\Theta(\sqrt{T})$ rate in episodic MDPs, revealing the difficulty introduced by missingness. This elevated lower bound is then matched by an Explore-Then-Commit algorithm whose cumulative regret is $\tilde{O}(T^{{\frac{2}{3}})$} for both discrete and Gaussian distributions. To highlight the consequence of missingness-dependent rewards in OTP, we study a variant called the Online Cost-sensitive Maximum Entropy Sampling Problem, where rewards are independent of missing data. This structure enables an iterative-elimination algorithm that achieves $\tilde{O}(\sqrt{T})$ regret, breaking the $\Omega(T^{{\frac{2}{3}})$} lower bound for OTP. Numerical results confirm our theory in both settings. Overall, this work deepens the understanding of the exploration--exploitation trade-off under missing data and guides the design of efficient sequential testing policies.