Derandomization regret bounds for PROWL

Establish finite-sample upper bounds on the derandomization regret V*(d_Q) − V*(T(Q)) within the PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL) framework, where d_Q denotes the randomized Gibbs policy induced by a posterior Q and T(Q) is a deterministic policy derived from Q. Specifically, extend margin-based and disintegrated PAC-Bayesian techniques to value-based objectives with inverse-propensity weighting and joint nuisance learning in order to quantify the performance gap incurred when replacing the randomized Gibbs policy by a deterministic rule.

Background

The PROWL framework provides PAC-Bayesian lower bounds on the latent target value for randomized Gibbs policies. However, clinical deployment typically requires a deterministic treatment rule, necessitating a transition from randomized to deterministic policies.

Bounding the loss incurred during this derandomization step requires techniques that handle value-based objectives with inverse-propensity weighting and joint nuisance learning, suggesting extensions of PAC-Bayesian margin and disintegration tools.

References

First, while Theorem~\ref{thm:exact-pac-bayes} strictly certifies the expected value of the randomized Gibbs policy $V\ast(d_Q)$, clinical applications typically demand a deterministic rule $T(Q)$. Bounding the derandomization regret $V\ast(d_Q)-V\ast(T(Q))$ is an open challenge, requiring the extension of margin-based and disintegrated PAC-Bayesian tools \citep{germain2015risk,biggs2022margins,viallard2024disintegration} to handle value-based objectives with inverse-propensity weighting and joint nuisance learning.

PAC-Bayesian Reward-Certified Outcome Weighted Learning  (2604.01946 - Ishikawa et al., 2 Apr 2026) in Section 6 (Discussion)