Derandomization regret bounds for PROWL
Establish finite-sample upper bounds on the derandomization regret V*(d_Q) − V*(T(Q)) within the PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL) framework, where d_Q denotes the randomized Gibbs policy induced by a posterior Q and T(Q) is a deterministic policy derived from Q. Specifically, extend margin-based and disintegrated PAC-Bayesian techniques to value-based objectives with inverse-propensity weighting and joint nuisance learning in order to quantify the performance gap incurred when replacing the randomized Gibbs policy by a deterministic rule.
References
First, while Theorem~\ref{thm:exact-pac-bayes} strictly certifies the expected value of the randomized Gibbs policy $V\ast(d_Q)$, clinical applications typically demand a deterministic rule $T(Q)$. Bounding the derandomization regret $V\ast(d_Q)-V\ast(T(Q))$ is an open challenge, requiring the extension of margin-based and disintegrated PAC-Bayesian tools \citep{germain2015risk,biggs2022margins,viallard2024disintegration} to handle value-based objectives with inverse-propensity weighting and joint nuisance learning.