Convergence guarantees for Delightful Policy Gradient

Establish formal convergence guarantees for the Delightful Policy Gradient (DG) update rule in reinforcement learning, specifying conditions under which DG converges and characterizing its limiting behavior.

Background

The paper introduces Delightful Policy Gradient (DG), which gates each policy-gradient term by a sigmoid of the product of advantage and action surprisal, thereby changing both variance and the expected update direction compared to standard policy gradients.

Although DG shows empirical improvements across MNIST contextual bandits, token-reversal sequence modeling with Transformers, and continuous control, the authors note that they have not established formal convergence guarantees for DG, leaving theoretical foundations such as convergence properties unproven.

References

Formal convergence guarantees remain open, as does the question of how far this mechanism transfers to sparse-reward settings, offline RL, and large-scale transformer training and RLHF.

Delightful Policy Gradient  (2603.14608 - Osband, 15 Mar 2026) in Section 8 (Conclusion)