ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update (2402.00348v1)
Abstract: In this study, we investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL). DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning. However, they typically perform much worse than current state-of-the-art (SOTA) methods that solely use action-level behavior constraint. After revisiting DICE-based methods, we find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state). Using forward gradient bears a large similarity to many offline RL methods, and thus can be regarded as applying action-level constraint. However, directly adding the backward gradient may degenerate or cancel out its effect if these two gradients have conflicting directions. To resolve this issue, we propose a simple yet effective modification that projects the backward gradient onto the normal plane of the forward gradient, resulting in an orthogonal-gradient update, a new learning rule for DICE-based methods. We conduct thorough theoretical analyses and find that the projected backward gradient brings state-level behavior regularization, which reveals the mystery of DICE-based methods: the value learning objective does try to impose state-action-level constraint, but needs to be used in a corrected way. Through toy examples and extensive experiments on complex offline RL and IL tasks, we demonstrate that DICE-based methods using orthogonal-gradient updates (O-DICE) achieve SOTA performance and great robustness.
- Uncertainty-based offline reinforcement learning with diversified q-ensemble. Proc. of NeurIPS, 2021.
- Layer normalization. CoRR, abs/1607.06450, 2016.
- Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. In Proc. of ICLR, 2021.
- Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. Machine Learning Proceedings 1995, pp. 30–37, 1995.
- Convex optimization. Cambridge university press, 2004.
- Offline rl without off-policy evaluation. Proc. of NeurIPS, 2021.
- Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016.
- BAIL: best-action imitation learning for batch deep reinforcement learning. In Proc. of NeurIPS, 2020.
- Numerical mathematics and computing. Cengage Learning, 2012.
- Look beneath the surface: Exploiting fundamental symmetry for sample-efficient offline rl. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Learning from conditional distributions via dual embeddings. In Proc. of AISTATS, pp. 1458–1467, 2017.
- Td learning with constrained gradients. 2018.
- D4rl: Datasets for deep data-driven reinforcement learning. ArXiv preprint, 2020.
- A minimalist approach to offline reinforcement learning. ArXiv preprint, 2021.
- Addressing function approximation error in actor-critic methods. In Proc. of ICML, pp. 1582–1591, 2018.
- Off-policy deep reinforcement learning without exploration. In Proc. of ICML, pp. 2052–2062, 2019.
- Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:4028–4039, 2021.
- Extreme q-learning: Maxent rl without entropy. In Proc. of ICLR, 2023.
- Improved training of wasserstein gans. In Proc. of NeurIPS, pp. 5767–5777, 2017.
- Reinforcement learning with deep energy-based policies. In Proc. of ICML, pp. 1352–1361, 2017.
- Mt-opt: Continuous multi-task robotic reinforcement learning at scale. ArXiv preprint, 2021.
- Demodice: Offline imitation learning with supplementary imperfect demonstrations. In Proc. of ICLR, 2021.
- Adam: A method for stochastic optimization. In Proc. of ICLR, 2015.
- Imitation learning via off-policy distribution matching. In Proc. of ICLR, 2020.
- Offline reinforcement learning with fisher divergence critic regularization. In Proc. of ICML, pp. 5774–5783, 2021a.
- Offline reinforcement learning with implicit q-learning. ArXiv preprint, 2021b.
- Stabilizing off-policy q-learning via bootstrapping error reduction. In Proc. of NeurIPS, pp. 11761–11771, 2019.
- Conservative q-learning for offline reinforcement learning. In Proc. of NeurIPS, 2020.
- Dr3: Value-based deep reinforcement learning requires explicit regularization. arXiv preprint arXiv:2112.04716, 2021.
- DR3: value-based deep reinforcement learning requires explicit regularization. In ICLR. OpenReview.net, 2022.
- Optidice: Offline policy optimization via stationary distribution correction estimation. In Proc. of ICML, pp. 6120–6130, 2021.
- Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. arXiv preprint arXiv:2204.08957, 2022.
- When data geometry meets deep function: Generalizing offline reinforcement learning. In The Eleventh International Conference on Learning Representations, 2022a.
- Mind the gap: Offline policy optimization for imperfect rewards. arXiv preprint arXiv:2302.01667, 2023a.
- Proto: Iterative policy regularized offline-to-online reinforcement learning. arXiv preprint arXiv:2305.15669, 2023b.
- Dealing with the unknown: Pessimistic offline reinforcement learning. In Conference on Robot Learning, pp. 1455–1464. PMLR, 2022b.
- Smodice: Versatile offline imitation learning via state occupancy matching. arXiv preprint arXiv:2202.02433, 2022.
- Alan S Manne. Linear programming and sequential decisions. Management Science, 6(3):259–267, 1960.
- Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Ofir Nachum and Bo Dai. Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866, 2020.
- Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Proc. of NeurIPS, pp. 2315–2325, 2019a.
- Algaedice: Policy gradient from arbitrary experience. ArXiv preprint, 2019b.
- Accelerating online reinforcement learning with offline datasets. ArXiv preprint, 2020.
- When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. Advances in Neural Information Processing Systems, 35:36599–36612, 2022.
- Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Proc. of NeurIPS, 1989.
- Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Toward the fundamental limits of imitation learning. Advances in Neural Information Processing Systems, 33:2914–2924, 2020.
- Imitation from arbitrary experience: A dual unification of reinforcement and imitation learning methods. arXiv preprint arXiv:2302.08560, 2023.
- Introduction to reinforcement learning. MIT press Cambridge, 1998.
- Model selection for offline reinforcement learning: Practical considerations for healthcare settings. In Proc. of ML4H, 2021.
- Offline multi-agent reinforcement learning with implicit global-to-local value regularization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Critic regularized regression. In Proc. of NeurIPS, 2020.
- Behavior regularized offline reinforcement learning. ArXiv preprint, 2019.
- Offline reinforcement learning with soft behavior regularization. ArXiv preprint, 2021.
- A policy-guided imitation approach for offline reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pp. 4085–4098, 2022a.
- Discriminator-weighted offline imitation learning from suboptimal demonstrations. In International Conference on Machine Learning, pp. 24725–24742. PMLR, 2022b.
- Constraints penalized q-learning for safe offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8753–8760, 2022c.
- Offline rl with no ood actions: In-sample learning via implicit value regularization. arXiv preprint arXiv:2303.15810, 2023.
- Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334, 2021.
- Off-policy evaluation via the regularized lagrangian. Advances in Neural Information Processing Systems, 33:6551–6561, 2020.
- Deepthermal: Combustion optimization for thermal power generating units using offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 4680–4688, 2022.
- State deviation correction for offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 9022–9030, 2022.
- Saformer: A conditional sequence modeling approach to offline safe reinforcement learning. arXiv preprint arXiv:2301.12203, 2023a.
- Gendice: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2019a.
- Deep residual reinforcement learning. arXiv preprint arXiv:1905.01072, 2019b.
- Gradientdice: Rethinking generalized offline estimation of stationary values. In International Conference on Machine Learning, pp. 11194–11203. PMLR, 2020.
- Discriminator-guided model-based offline imitation learning. In Conference on Robot Learning, pp. 1266–1276. PMLR, 2023b.
- Safe offline reinforcement learning with feasibility-guided diffusion model. In The Twelfth International Conference on Learning Representations, 2024.
- Latent action space for offline reinforcement learning. In Conference on Robot Learning, 2020.
- Off-policy imitation learning from observations. Advances in Neural Information Processing Systems, 33:12402–12413, 2020.