Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond OOD State Actions: Supported Cross-Domain Offline Reinforcement Learning (2306.12755v1)

Published 22 Jun 2023 in cs.LG

Abstract: Offline reinforcement learning (RL) aims to learn a policy using only pre-collected and fixed data. Although avoiding the time-consuming online interactions in RL, it poses challenges for out-of-distribution (OOD) state actions and often suffers from data inefficiency for training. Despite many efforts being devoted to addressing OOD state actions, the latter (data inefficiency) receives little attention in offline RL. To address this, this paper proposes the cross-domain offline RL, which assumes offline data incorporate additional source-domain data from varying transition dynamics (environments), and expects it to contribute to the offline data efficiency. To do so, we identify a new challenge of OOD transition dynamics, beyond the common OOD state actions issue, when utilizing cross-domain offline data. Then, we propose our method BOSA, which employs two support-constrained objectives to address the above OOD issues. Through extensive experiments in the cross-domain offline RL setting, we demonstrate BOSA can greatly improve offline data efficiency: using only 10\% of the target data, BOSA could achieve {74.4\%} of the SOTA offline RL performance that uses 100\% of the target data. Additionally, we also show BOSA can be effortlessly plugged into model-based offline RL and noising data augmentation techniques (used for generating source-domain data), which naturally avoids the potential dynamics mismatch between target-domain data and newly generated source-domain data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in neural information processing systems, 34:7436–7447, 2021.
  2. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. arXiv preprint arXiv:2202.11566, 2022.
  3. Behavioral priors and dynamics models: Improving performance and domain transfer in offline rl. arXiv preprint arXiv:2106.09119, 2021.
  4. Actionable models: Unsupervised offline reinforcement learning of robotic skills. arXiv preprint arXiv:2104.07749, 2021.
  5. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  6. Off-dynamics reinforcement learning: Training for transfer with domain classifiers. arXiv preprint arXiv:2006.13916, 2020.
  7. Mismatched no more: Joint model-policy optimization for model-based rl. arXiv preprint arXiv:2110.02758, 2021.
  8. Cross-domain imitation learning via optimal transport. arXiv preprint arXiv:2110.03684, 2021.
  9. Learn what matters: cross-domain imitation learning with task-relevant embeddings. arXiv preprint arXiv:2209.12093, 2022.
  10. D4RL: datasets for deep data-driven reinforcement learning. CoRR, abs/2004.07219, 2020. URL https://arxiv.org/abs/2004.07219.
  11. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
  12. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062. PMLR, 2019.
  13. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160–2169. PMLR, 2019.
  14. Emaq: Expected-max q-learning operator for simple yet effective offline and online rl. In International Conference on Machine Learning, pages 3682–3691. PMLR, 2021.
  15. A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330, 2022.
  16. Weighted model estimation for offline model-based reinforcement learning. Advances in Neural Information Processing Systems, 34:17789–17800, 2021.
  17. Off-dynamics inverse reinforcement learning from hetero-domain. arXiv preprint arXiv:2110.11443, 2021.
  18. Beyond reward: Offline preference-guided policy optimization. arXiv preprint arXiv:2305.16217, 2023.
  19. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  20. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  21. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  22. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  23. Chipformer: Transferable chip placement via offline decision transformer. ICML, 2023. URL https://openreview.net/pdf?id=j0miEWtw87.
  24. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  25. Unsupervised domain adaptation with dynamics-aware rewards in reinforcement learning. Advances in Neural Information Processing Systems, 34:28784–28797, 2021.
  26. Learn goal-conditioned policy with intrinsic motivation for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 36, pages 7558–7566, 2022a.
  27. Dara: Dynamics-aware reward augmentation in offline reinforcement learning. arXiv preprint arXiv:2203.06662, 2022b.
  28. DROP: Conservative model-based optimization for offline reinforcement learning, 2023. URL https://openreview.net/forum?id=ttfOGx6-_FT.
  29. Constrained variational policy optimization for safe reinforcement learning. In International Conference on Machine Learning, pages 13644–13668. PMLR, 2022c.
  30. Revisiting design choices in model-based offline reinforcement learning. arXiv preprint arXiv:2110.04135, 2021.
  31. Mildly conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2206.04745, 2022.
  32. Conservative offline distributional reinforcement learning. Advances in Neural Information Processing Systems, 34:19235–19247, 2021.
  33. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
  34. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  35. Offline reinforcement learning as anti-exploration. arXiv preprint arXiv:2106.06431, 2021.
  36. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  37. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.
  38. Unsupervised discovery of transitional skills for deep reinforcement learning. In International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
  39. Robust inverse reinforcement learning under transition dynamics mismatch. Advances in Neural Information Processing Systems, 34:25917–25931, 2021.
  40. Supported policy optimization for offline reinforcement learning. arXiv preprint arXiv:2202.06239, 2022.
  41. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019.
  42. Uncertainty weighted actor-critic for offline reinforcement learning. arXiv preprint arXiv:2105.08140, 2021.
  43. Mutual alignment transfer learning. In Conference on Robot Learning, pages 281–290. PMLR, 2017.
  44. Regularizing a model-based policy stationary distribution to stabilize offline reinforcement learning. In International Conference on Machine Learning, pages 24980–25006. PMLR, 2022.
  45. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  46. Combo: Conservative offline model-based policy optimization. Advances in neural information processing systems, 34:28954–28967, 2021.
  47. Discriminator-guided model-based offline imitation learning. arXiv preprint arXiv:2207.00244, 2022.
  48. Behavior proximal policy optimization. arXiv preprint arXiv:2302.11312, 2023.
Citations (11)

Summary

We haven't generated a summary for this paper yet.