Constrained Decision Transformer for Offline Safe Reinforcement Learning (2302.07351v2)
Abstract: Safe reinforcement learning (RL) trains a constraint satisfaction policy by interacting with the environment. We aim to tackle a more challenging problem: learning a safe policy from an offline dataset. We study the offline safe RL problem from a novel multi-objective optimization perspective and propose the $\epsilon$-reducible concept to characterize problem difficulties. The inherent trade-offs between safety and task performance inspire us to propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment. Extensive experiments show the advantages of the proposed method in learning an adaptive, safe, robust, and high-reward policy. CDT outperforms its variants and strong offline safe RL baselines by a large margin with the same hyperparameters across all tasks, while keeping the zero-shot adaptation capability to different constraint thresholds, making our approach more suitable for real-world RL under constraints. The code is available at https://github.com/liuzuxin/OSRL.
- Constrained policy optimization. In International Conference on Machine Learning, pp. 22–31. PMLR, 2017.
- Altman, E. Constrained markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Mathematical methods of operations research, 48(3):387–417, 1998.
- Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5, 2021.
- Context-aware safe reinforcement learning for non-stationary environments. arXiv preprint arXiv:2101.00531, 2021a.
- Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345, 2021b.
- A primal-dual approach to constrained markov decision processes. arXiv preprint arXiv:2101.10895, 2021c.
- Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.
- Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
- Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
- Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.
- Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021.
- Saac: Safe reinforcement learning as an adversarial game of actor-critics. arXiv preprint arXiv:2204.09424, 2022.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
- A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019.
- Generalized decision transformer for offline hindsight information matching. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=CAjxVodl_v.
- A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Mind your data! hiding backdoors in offline reinforcement learning datasets. arXiv preprint arXiv:2210.04688, 2022.
- Gronauer, S. Bullet-safety-gym: Aframework for constrained reinforcement learning. 2022.
- A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330, 2022.
- Rl unplugged: A suite of benchmarks for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:7248–7259, 2020.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
- Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
- Omnisafe: An infrastructure for accelerating safe reinforcement learning research. arXiv preprint arXiv:2305.09304, 2023.
- Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712. PMLR, 2019.
- Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. arXiv preprint arXiv:2204.08957, 2022.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
- Constrained model-based reinforcement learning with robust cross-entropy method. arXiv preprint arXiv:2010.07968, 2020.
- Constrained variational policy optimization for safe reinforcement learning. In International Conference on Machine Learning, pp. 13644–13668. PMLR, 2022a.
- On the robustness of safe reinforcement learning under observational perturbations. arXiv preprint arXiv:2205.14691, 2022b.
- Datasets and benchmarks for offline safe reinforcement learning. arXiv preprint arXiv:2306.09303, 2023.
- Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. arXiv preprint arXiv:2212.11419, 2022.
- Learning barrier certificates: Towards safe reinforcement learning with zero training-time violations. Advances in Neural Information Processing Systems, 34, 2021.
- Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32, 2019a.
- Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019b.
- Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
- Constrained offline policy optimization. In International Conference on Machine Learning, pp. 17801–17810. PMLR, 2022.
- A survey on offline reinforcement learning: Taxonomy, review, and open problems. arXiv preprint arXiv:2203.01387, 2022.
- Improving language understanding by generative pre-training. 2018.
- Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7, 2019.
- Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173, 2017.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Offline reinforcement learning from heteroskedastic data via support constraints. In Deep Reinforcement Learning Workshop NeurIPS 2022.
- S4rl: Surprisingly simple self-supervision for offline reinforcement learning in robotics. In Conference on Robot Learning, pp. 907–917. PMLR, 2022.
- Sauté rl: Almost surely safe reinforcement learning using state augmentation. In International Conference on Machine Learning, pp. 20423–20443. PMLR, 2022.
- Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, pp. 9133–9143. PMLR, 2020.
- Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2446–2454, 2020.
- CORL: Research-oriented deep offline reinforcement learning library. In 3rd Offline RL Workshop: Offline RL as a ”Launchpad”, 2022. URL https://openreview.net/forum?id=SyAS49bBcv.
- Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
- Safe reinforcement learning using advantage-based intervention. arXiv preprint arXiv:2106.09110, 2021.
- Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020.
- Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34:27395–27407, 2021.
- Constraints penalized q-learning for safe offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 8753–8760, 2022a.
- Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability. arXiv preprint arXiv:2209.08025, 2022b.
- Prompting decision transformer for few-shot policy generalization. In International Conference on Machine Learning, pp. 24631–24645. PMLR, 2022c.
- Wcsac: Worst-case soft actor critic for safety-constrained reinforcement learning. In AAAI, pp. 10639–10646, 2021.
- Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152, 2020.
- Gendice: Generalized offline estimation of stationary values. arXiv preprint arXiv:2002.09072, 2020a.
- First order constrained optimization in policy space. Advances in Neural Information Processing Systems, 2020b.
- Model-free safe control for zero-violation reinforcement learning. In 5th Annual Conference on Robot Learning, 2021.
- Online decision transformer. arXiv preprint arXiv:2202.05607, 2022.
- Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.