Implicit Constraint-Aware Off-Policy Correction for Offline Reinforcement Learning (2506.14058v1)

Published 16 Jun 2025 in eess.SY and cs.SY

Abstract: Offline reinforcement learning promises policy improvement from logged interaction data alone, yet state-of-the-art algorithms remain vulnerable to value over-estimation and to violations of domain knowledge such as monotonicity or smoothness. We introduce implicit constraint-aware off-policy correction, a framework that embeds structural priors directly inside every BeLLMan update. The key idea is to compose the optimal BeLLMan operator with a proximal projection on a convex constraint set, which produces a new operator that (i) remains a $\gamma$-contraction, (ii) possesses a unique fixed point, and (iii) enforces the prescribed structure exactly. A differentiable optimization layer solves the projection; implicit differentiation supplies gradients for deep function approximators at a cost comparable to implicit Q-learning. On a synthetic Bid-Click auction -- where the true value is provably monotone in the bid -- our method eliminates all monotonicity violations and outperforms conservative Q-learning and implicit Q-learning in return, regret, and sample efficiency.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (1)

Ali Baheri

Implicit Constraint-Aware Off-Policy Correction for Offline Reinforcement Learning (2506.14058v1)

Summary

Follow-up Questions

Related Papers

Authors (1)