Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Out-of-Distribution Adaptation in Offline RL: Counterfactual Reasoning via Causal Normalizing Flows (2405.03892v1)

Published 6 May 2024 in cs.LG and cs.AI

Abstract: Despite notable successes of Reinforcement Learning (RL), the prevalent use of an online learning paradigm prevents its widespread adoption, especially in hazardous or costly scenarios. Offline RL has emerged as an alternative solution, learning from pre-collected static datasets. However, this offline learning introduces a new challenge known as distributional shift, degrading the performance when the policy is evaluated on scenarios that are Out-Of-Distribution (OOD) from the training dataset. Most existing offline RL resolves this issue by regularizing policy learning within the information supported by the given dataset. However, such regularization overlooks the potential for high-reward regions that may exist beyond the dataset. This motivates exploring novel offline learning techniques that can make improvements beyond the data support without compromising policy performance, potentially by learning causation (cause-and-effect) instead of correlation from the dataset. In this paper, we propose the MOOD-CRL (Model-based Offline OOD-Adapting Causal RL) algorithm, which aims to address the challenge of extrapolation for offline policy training through causal inference instead of policy-regularizing methods. Specifically, Causal Normalizing Flow (CNF) is developed to learn the transition and reward functions for data generation and augmentation in offline policy evaluation and training. Based on the data-invariant, physics-based qualitative causal graph and the observational data, we develop a novel learning scheme for CNF to learn the quantitative structural causal model. As a result, CNF gains predictive and counterfactual reasoning capabilities for sequential decision-making tasks, revealing a high potential for OOD adaptation. Our CNF-based offline RL approach is validated through empirical evaluations, outperforming model-free and model-based methods by a significant margin.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Data-driven deep reinforcement learning. https://bair.berkeley.edu/blog/2019/12/05/bear/. Accessed: 2023-01-13.
  2. Constrained policy optimization. In International conference on machine learning, pages 22–31. PMLR, 2017.
  3. Openai gym, 2016.
  4. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  5. Constrained meta-reinforcement learning for adaptable safety guarantee with differentiable convex programming. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 20975–20983, 2024.
  6. Causal reinforcement learning: A survey. arXiv preprint arXiv:2307.01452, 2023.
  7. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR, 2019.
  8. Causal reinforcement learning using observational and interventional data. arXiv preprint arXiv:2106.14421, 2021.
  9. Review of causal discovery methods based on graphical models. Frontiers in genetics, 10:524, 2019.
  10. Confidence-conditioned value functions for offline reinforcement learning. arXiv preprint arXiv:2212.04607, 2022.
  11. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
  12. Causal normalizing flows: from theory to practice. arXiv preprint arXiv:2306.05415, 2023.
  13. Lyapunov density models: Constraining distribution shift in learning-based control. In International Conference on Machine Learning, pages 10708–10733. PMLR, 2022.
  14. Causal autoregressive flows. In International conference on artificial intelligence and statistics, pages 3520–3528. PMLR, 2021.
  15. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  16. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  17. Can neurofeedback provide evidence of direct brain-behavior causality? NeuroImage, 258:119400, 2022.
  18. Optidice: Offline policy optimization via stationary distribution correction estimation. In International Conference on Machine Learning, pages 6120–6130. PMLR, 2021.
  19. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. arXiv preprint arXiv:2204.08957, 2022.
  20. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  21. Efficient reinforcement learning with prior causal knowledge. In Conference on Causal Learning and Reasoning, pages 526–541. PMLR, 2022.
  22. Mutual information regularized offline reinforcement learning. arXiv preprint arXiv:2210.07484, 2022.
  23. Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
  24. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617–2680, 2021.
  25. Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30, 2017.
  26. Judea Pearl. Causality. Cambridge university press, 2009.
  27. Judea Pearl. Causal inference. Causality: objectives and assessment, pages 39–58, 2010.
  28. A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  29. Data-driven offline decision-making via invariant representation learning. arXiv preprint arXiv:2211.11349, 2022.
  30. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  31. Offline reinforcement learning as anti-exploration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8106–8114, 2022.
  32. Learning internal representations by error propagation. 1986.
  33. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  34. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  35. Gymnasium, March 2023.
  36. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  37. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  38. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  39. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  40. A survey on causal reinforcement learning. arXiv preprint arXiv:2302.05209, 2023.
  41. Causal discovery with reinforcement learning. arXiv preprint arXiv:1906.04477, 2019.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets