Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update (2402.00348v1)

Published 1 Feb 2024 in cs.LG and cs.AI

Abstract: In this study, we investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL). DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning. However, they typically perform much worse than current state-of-the-art (SOTA) methods that solely use action-level behavior constraint. After revisiting DICE-based methods, we find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state). Using forward gradient bears a large similarity to many offline RL methods, and thus can be regarded as applying action-level constraint. However, directly adding the backward gradient may degenerate or cancel out its effect if these two gradients have conflicting directions. To resolve this issue, we propose a simple yet effective modification that projects the backward gradient onto the normal plane of the forward gradient, resulting in an orthogonal-gradient update, a new learning rule for DICE-based methods. We conduct thorough theoretical analyses and find that the projected backward gradient brings state-level behavior regularization, which reveals the mystery of DICE-based methods: the value learning objective does try to impose state-action-level constraint, but needs to be used in a corrected way. Through toy examples and extensive experiments on complex offline RL and IL tasks, we demonstrate that DICE-based methods using orthogonal-gradient updates (O-DICE) achieve SOTA performance and great robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Proc. of NeurIPS, 2021.
  2. Layer normalization. CoRR, abs/1607.06450, 2016.
  3. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. In Proc. of ICLR, 2021.
  4. Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. Machine Learning Proceedings 1995, pp.  30–37, 1995.
  5. Convex optimization. Cambridge university press, 2004.
  6. Offline rl without off-policy evaluation. Proc. of NeurIPS, 2021.
  7. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016.
  8. BAIL: best-action imitation learning for batch deep reinforcement learning. In Proc. of NeurIPS, 2020.
  9. Numerical mathematics and computing. Cengage Learning, 2012.
  10. Look beneath the surface: Exploiting fundamental symmetry for sample-efficient offline rl. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  11. Learning from conditional distributions via dual embeddings. In Proc. of AISTATS, pp.  1458–1467, 2017.
  12. Td learning with constrained gradients. 2018.
  13. D4rl: Datasets for deep data-driven reinforcement learning. ArXiv preprint, 2020.
  14. A minimalist approach to offline reinforcement learning. ArXiv preprint, 2021.
  15. Addressing function approximation error in actor-critic methods. In Proc. of ICML, pp.  1582–1591, 2018.
  16. Off-policy deep reinforcement learning without exploration. In Proc. of ICML, pp.  2052–2062, 2019.
  17. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:4028–4039, 2021.
  18. Extreme q-learning: Maxent rl without entropy. In Proc. of ICLR, 2023.
  19. Improved training of wasserstein gans. In Proc. of NeurIPS, pp.  5767–5777, 2017.
  20. Reinforcement learning with deep energy-based policies. In Proc. of ICML, pp.  1352–1361, 2017.
  21. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. ArXiv preprint, 2021.
  22. Demodice: Offline imitation learning with supplementary imperfect demonstrations. In Proc. of ICLR, 2021.
  23. Adam: A method for stochastic optimization. In Proc. of ICLR, 2015.
  24. Imitation learning via off-policy distribution matching. In Proc. of ICLR, 2020.
  25. Offline reinforcement learning with fisher divergence critic regularization. In Proc. of ICML, pp.  5774–5783, 2021a.
  26. Offline reinforcement learning with implicit q-learning. ArXiv preprint, 2021b.
  27. Stabilizing off-policy q-learning via bootstrapping error reduction. In Proc. of NeurIPS, pp.  11761–11771, 2019.
  28. Conservative q-learning for offline reinforcement learning. In Proc. of NeurIPS, 2020.
  29. Dr3: Value-based deep reinforcement learning requires explicit regularization. arXiv preprint arXiv:2112.04716, 2021.
  30. DR3: value-based deep reinforcement learning requires explicit regularization. In ICLR. OpenReview.net, 2022.
  31. Optidice: Offline policy optimization via stationary distribution correction estimation. In Proc. of ICML, pp.  6120–6130, 2021.
  32. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. arXiv preprint arXiv:2204.08957, 2022.
  33. When data geometry meets deep function: Generalizing offline reinforcement learning. In The Eleventh International Conference on Learning Representations, 2022a.
  34. Mind the gap: Offline policy optimization for imperfect rewards. arXiv preprint arXiv:2302.01667, 2023a.
  35. Proto: Iterative policy regularized offline-to-online reinforcement learning. arXiv preprint arXiv:2305.15669, 2023b.
  36. Dealing with the unknown: Pessimistic offline reinforcement learning. In Conference on Robot Learning, pp.  1455–1464. PMLR, 2022b.
  37. Smodice: Versatile offline imitation learning via state occupancy matching. arXiv preprint arXiv:2202.02433, 2022.
  38. Alan S Manne. Linear programming and sequential decisions. Management Science, 6(3):259–267, 1960.
  39. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  40. Ofir Nachum and Bo Dai. Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866, 2020.
  41. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. In Proc. of NeurIPS, pp.  2315–2325, 2019a.
  42. Algaedice: Policy gradient from arbitrary experience. ArXiv preprint, 2019b.
  43. Accelerating online reinforcement learning with offline datasets. ArXiv preprint, 2020.
  44. When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. Advances in Neural Information Processing Systems, 35:36599–36612, 2022.
  45. Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Proc. of NeurIPS, 1989.
  46. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  47. Toward the fundamental limits of imitation learning. Advances in Neural Information Processing Systems, 33:2914–2924, 2020.
  48. Imitation from arbitrary experience: A dual unification of reinforcement and imitation learning methods. arXiv preprint arXiv:2302.08560, 2023.
  49. Introduction to reinforcement learning. MIT press Cambridge, 1998.
  50. Model selection for offline reinforcement learning: Practical considerations for healthcare settings. In Proc. of ML4H, 2021.
  51. Offline multi-agent reinforcement learning with implicit global-to-local value regularization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  52. Critic regularized regression. In Proc. of NeurIPS, 2020.
  53. Behavior regularized offline reinforcement learning. ArXiv preprint, 2019.
  54. Offline reinforcement learning with soft behavior regularization. ArXiv preprint, 2021.
  55. A policy-guided imitation approach for offline reinforcement learning. In Advances in Neural Information Processing Systems, volume 35, pp.  4085–4098, 2022a.
  56. Discriminator-weighted offline imitation learning from suboptimal demonstrations. In International Conference on Machine Learning, pp. 24725–24742. PMLR, 2022b.
  57. Constraints penalized q-learning for safe offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  8753–8760, 2022c.
  58. Offline rl with no ood actions: In-sample learning via implicit value regularization. arXiv preprint arXiv:2303.15810, 2023.
  59. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334, 2021.
  60. Off-policy evaluation via the regularized lagrangian. Advances in Neural Information Processing Systems, 33:6551–6561, 2020.
  61. Deepthermal: Combustion optimization for thermal power generating units using offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  4680–4688, 2022.
  62. State deviation correction for offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  9022–9030, 2022.
  63. Saformer: A conditional sequence modeling approach to offline safe reinforcement learning. arXiv preprint arXiv:2301.12203, 2023a.
  64. Gendice: Generalized offline estimation of stationary values. In International Conference on Learning Representations, 2019a.
  65. Deep residual reinforcement learning. arXiv preprint arXiv:1905.01072, 2019b.
  66. Gradientdice: Rethinking generalized offline estimation of stationary values. In International Conference on Machine Learning, pp. 11194–11203. PMLR, 2020.
  67. Discriminator-guided model-based offline imitation learning. In Conference on Robot Learning, pp.  1266–1276. PMLR, 2023b.
  68. Safe offline reinforcement learning with feasibility-guided diffusion model. In The Twelfth International Conference on Learning Representations, 2024.
  69. Latent action space for offline reinforcement learning. In Conference on Robot Learning, 2020.
  70. Off-policy imitation learning from observations. Advances in Neural Information Processing Systems, 33:12402–12413, 2020.
Citations (9)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com