A2PO: Towards Effective Offline Reinforcement Learning from an Advantage-aware Perspective (2403.07262v4)
Abstract: Offline reinforcement learning endeavors to leverage offline datasets to craft effective agent policy without online interaction, which imposes proper conservative constraints with the support of behavior policies to tackle the out-of-distribution problem. However, existing works often suffer from the constraint conflict issue when offline datasets are collected from multiple behavior policies, i.e., different behavior policies may exhibit inconsistent actions with distinct returns across the state space. To remedy this issue, recent advantage-weighted methods prioritize samples with high advantage values for agent training while inevitably ignoring the diversity of behavior policy. In this paper, we introduce a novel Advantage-Aware Policy Optimization (A2PO) method to explicitly construct advantage-aware policy constraints for offline learning under mixed-quality datasets. Specifically, A2PO employs a conditional variational auto-encoder to disentangle the action distributions of intertwined behavior policies by modeling the advantage values of all training data as conditional variables. Then the agent can follow such disentangled action distribution constraints to optimize the advantage-aware policy towards high advantage values. Extensive experiments conducted on both the single-quality and mixed-quality datasets of the D4RL benchmark demonstrate that A2PO yields results superior to the counterparts. Our code is available at https://github.com/Plankson/A2PO
- Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. arXiv preprint arXiv:2309.10150, 2023.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
- Bail: Best-action imitation learning for batch deep reinforcement learning. Advances in Neural Information Processing Systems, 33:18353–18363, 2020.
- Lapo: Latent-variable advantage-weighted policy optimization for offline reinforcement learning. Annual Conference on Neural Information Processing Systems, 35:36902–36913, 2022.
- Uncertainty-aware model-based offline reinforcement learning for automated driving. IEEE Robotics and Automation Letters, 8(2):1167–1174, 2023.
- Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751, 2021.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
- A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
- Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, 2018.
- Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, 2019.
- Cirs: Bursting filter bubbles by counterfactual interactive recommender system. ACM Transactions on Information Systems, 42(1):1–27, 2023a.
- Act: Empowering decision transformer with dynamic programming via advantage conditioning. arXiv preprint arXiv:2309.05915, 2023b.
- Why so pessimistic? estimating uncertainties for offline rl through ensembles, and why their independence matters. Annual Conference on Neural Information Processing Systems, 35:18267–18281, 2022.
- Confidence-conditioned value functions for offline reinforcement learning. arXiv preprint arXiv:2212.04607, 2022.
- Harnessing mixed offline reinforcement learning datasets via trajectory weighting. arXiv preprint arXiv:2306.13085, 2023a.
- Beyond uniform sampling: Offline reinforcement learning with imbalanced datasets. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
- On the sample complexity of vanilla model-based offline reinforcement learning with dependent samples. arXiv preprint arXiv:2303.04268, 2023.
- Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Annual Conference on Neural Information Processing Systems, 32, 2019.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Emergent agentic transformer from chain of hindsight experience. arXiv preprint arXiv:2305.16554, 2023.
- Mildly conservative q-learning for offline reinforcement learning. Annual Conference on Neural Information Processing Systems, 35:1711–1724, 2022.
- Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
- Pytorch: An imperative style, high-performance deep learning library. In Annual Conference on Neural Information Processing Systems, 2019.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- Weighted policy constraints for offline reinforcement learning. In AAAI Conference on Artificial Intelligence, 2023.
- Pomerleau, D. A. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
- A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Fast offline policy optimization for large scale recommendation. In AAAI Conference on Artificial Intelligence, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Deterministic policy gradient algorithms. In International Conference on Machine Learning, 2014.
- Offline rl with realistic datasets: Heteroskedasticity and support constraints. arXiv preprint arXiv:2211.01052, 2022.
- Learning structured output representation using deep conditional generative models. Annual Conference on Neural Information Processing Systems, 28, 2015.
- Reinforcement learning: An introduction. MIT press, 2018.
- Learning from good trajectories in offline multi-agent reinforcement learning. In AAAI Conference on Artificial Intelligence, 2023.
- Dasco: Dual-generator adversarial support constrained offline reinforcement learning. Annual Conference on Neural Information Processing Systems, 35:38937–38949, 2022.
- Leveraging offline data in online reinforcement learning. In International Conference on Machine Learning, 2023.
- Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning. arXiv preprint arXiv:2310.17966, 2023.
- Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
- Supported policy optimization for offline reinforcement learning. Annual Conference on Neural Information Processing Systems, 35:31278–31291, 2022.
- The in-sample softmax for offline reinforcement learning. In International Conference on Learning Representations, 2022.
- Offline rl with no ood actions: In-sample learning via implicit value regularization. arXiv preprint arXiv:2303.15810, 2023.
- Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline rl. In International Conference on Machine Learning, pp. 38989–39007, 2023.
- Model-based offline policy optimization with adversarial network. arXiv preprint arXiv:2309.02157, 2023.
- Mopo: Model-based offline policy optimization. Annual Conference on Neural Information Processing Systems, 33:14129–14142, 2020.
- Semi-supervised offline reinforcement learning with action-free trajectories. In International conference on machine learning, 2023.
- Plas: Latent action space for offline reinforcement learning. In Conference on Robot Learning, 2021.
- Behavior proximal policy optimization. arXiv preprint arXiv:2302.11312, 2023.