Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Constrained Decision Transformer for Offline Safe Reinforcement Learning (2302.07351v2)

Published 14 Feb 2023 in cs.LG, cs.AI, and cs.RO

Abstract: Safe reinforcement learning (RL) trains a constraint satisfaction policy by interacting with the environment. We aim to tackle a more challenging problem: learning a safe policy from an offline dataset. We study the offline safe RL problem from a novel multi-objective optimization perspective and propose the $\epsilon$-reducible concept to characterize problem difficulties. The inherent trade-offs between safety and task performance inspire us to propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment. Extensive experiments show the advantages of the proposed method in learning an adaptive, safe, robust, and high-reward policy. CDT outperforms its variants and strong offline safe RL baselines by a large margin with the same hyperparameters across all tasks, while keeping the zero-shot adaptation capability to different constraint thresholds, making our approach more suitable for real-world RL under constraints. The code is available at https://github.com/liuzuxin/OSRL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Constrained policy optimization. In International Conference on Machine Learning, pp.  22–31. PMLR, 2017.
  2. Altman, E. Constrained markov decision processes with total cost criteria: Lagrangian approach and dual linear program. Mathematical methods of operations research, 48(3):387–417, 1998.
  3. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems, 5, 2021.
  4. Context-aware safe reinforcement learning for non-stationary environments. arXiv preprint arXiv:2101.00531, 2021a.
  5. Decision transformer: Reinforcement learning via sequence modeling. arXiv preprint arXiv:2106.01345, 2021b.
  6. A primal-dual approach to constrained markov decision processes. arXiv preprint arXiv:2101.10895, 2021c.
  7. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18(1):6070–6120, 2017.
  8. Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031, 2019.
  9. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
  10. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6, 2005.
  11. Maximum entropy rl (provably) solves some robust rl problems. arXiv preprint arXiv:2103.06257, 2021.
  12. Saac: Safe reinforcement learning as an adversarial game of actor-critics. arXiv preprint arXiv:2204.09424, 2022.
  13. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  14. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021.
  15. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp. 2052–2062. PMLR, 2019.
  16. Generalized decision transformer for offline hindsight information matching. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=CAjxVodl_v.
  17. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
  18. Mind your data! hiding backdoors in offline reinforcement learning datasets. arXiv preprint arXiv:2210.04688, 2022.
  19. Gronauer, S. Bullet-safety-gym: Aframework for constrained reinforcement learning. 2022.
  20. A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330, 2022.
  21. Rl unplugged: A suite of benchmarks for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:7248–7259, 2020.
  22. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
  23. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.
  24. Omnisafe: An infrastructure for accelerating safe reinforcement learning research. arXiv preprint arXiv:2305.09304, 2023.
  25. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pp. 5774–5783. PMLR, 2021.
  26. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  27. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  28. Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712. PMLR, 2019.
  29. Coptidice: Offline constrained reinforcement learning via stationary distribution correction estimation. arXiv preprint arXiv:2204.08957, 2022.
  30. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  31. Constrained model-based reinforcement learning with robust cross-entropy method. arXiv preprint arXiv:2010.07968, 2020.
  32. Constrained variational policy optimization for safe reinforcement learning. In International Conference on Machine Learning, pp. 13644–13668. PMLR, 2022a.
  33. On the robustness of safe reinforcement learning under observational perturbations. arXiv preprint arXiv:2205.14691, 2022b.
  34. Datasets and benchmarks for offline safe reinforcement learning. arXiv preprint arXiv:2306.09303, 2023.
  35. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios. arXiv preprint arXiv:2212.11419, 2022.
  36. Learning barrier certificates: Towards safe reinforcement learning with zero training-time violations. Advances in Neural Information Processing Systems, 34, 2021.
  37. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32, 2019a.
  38. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019b.
  39. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020.
  40. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  41. Constrained offline policy optimization. In International Conference on Machine Learning, pp. 17801–17810. PMLR, 2022.
  42. A survey on offline reinforcement learning: Taxonomy, review, and open problems. arXiv preprint arXiv:2203.01387, 2022.
  43. Improving language understanding by generative pre-training. 2018.
  44. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7, 2019.
  45. Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173, 2017.
  46. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  47. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  48. Offline reinforcement learning from heteroskedastic data via support constraints. In Deep Reinforcement Learning Workshop NeurIPS 2022.
  49. S4rl: Surprisingly simple self-supervision for offline reinforcement learning in robotics. In Conference on Robot Learning, pp.  907–917. PMLR, 2022.
  50. Sauté rl: Almost surely safe reinforcement learning using state augmentation. In International Conference on Machine Learning, pp. 20423–20443. PMLR, 2022.
  51. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, pp. 9133–9143. PMLR, 2020.
  52. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2446–2454, 2020.
  53. CORL: Research-oriented deep offline reinforcement learning library. In 3rd Offline RL Workshop: Offline RL as a ”Launchpad”, 2022. URL https://openreview.net/forum?id=SyAS49bBcv.
  54. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
  55. Safe reinforcement learning using advantage-based intervention. arXiv preprint arXiv:2106.09110, 2021.
  56. Critic regularized regression. Advances in Neural Information Processing Systems, 33:7768–7778, 2020.
  57. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34:27395–27407, 2021.
  58. Constraints penalized q-learning for safe offline reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  8753–8760, 2022a.
  59. Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability. arXiv preprint arXiv:2209.08025, 2022b.
  60. Prompting decision transformer for few-shot policy generalization. In International Conference on Machine Learning, pp. 24631–24645. PMLR, 2022c.
  61. Wcsac: Worst-case soft actor critic for safety-constrained reinforcement learning. In AAAI, pp.  10639–10646, 2021.
  62. Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152, 2020.
  63. Gendice: Generalized offline estimation of stationary values. arXiv preprint arXiv:2002.09072, 2020a.
  64. First order constrained optimization in policy space. Advances in Neural Information Processing Systems, 2020b.
  65. Model-free safe control for zero-violation reinforcement learning. In 5th Annual Conference on Robot Learning, 2021.
  66. Online decision transformer. arXiv preprint arXiv:2202.05607, 2022.
  67. Ziebart, B. D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
Citations (34)

Summary

  • The paper introduces the Constrained Decision Transformer, reframing offline safe RL as a multi-objective optimization problem to balance safety and rewards.
  • It employs ε-reducible datasets and return-conditioned sequential modeling to dynamically adjust constraints without retraining.
  • Empirical results on robotic locomotion tasks demonstrate CDT's superior performance in adhering to safety constraints while optimizing rewards.

Overview of Constrained Decision Transformer for Offline Safe Reinforcement Learning

This paper presents a novel approach to tackle the challenge of Offline Safe Reinforcement Learning (RL), where learning safe policies directly from offline datasets is a long-standing problem in ensuring constraint satisfaction without active interaction with the environment. The authors propose a methodological framework named the Constrained Decision Transformer (CDT), which aims to address this challenge by considering the problem through the lens of multi-objective optimization (MOO).

Contribution and Methodology

The CDT approach takes into account the trade-offs between safety and task performance by leveraging the Pareto efficiency in multi-objective optimization. The authors contribute three primary innovations to this challenging area:

  1. Multi-Objective Optimization Perspective: The authors introduce the concept of ϵ\epsilon-reducible datasets to categorize the difficulty of offline safe RL tasks. This concept describes a dataset's complexity based on its Pareto and Inverse Pareto Frontiers, effectively showcasing the inherent trade-offs between achieving high rewards and maintaining safety within predefined cost thresholds.
  2. Dynamic Constraint Adaptation: CDT employs a return-conditioned sequential modeling framework, which allows for the dynamic adjustment of the trade-offs between safety and performance across varying constraints during deployment. This adaptability is a significant advancement since prior approaches required setting a fixed constraint threshold before training, limiting generalizability to new conditions.
  3. Integration of Stochastic Policies: The incorporation of stochastic policies with entropy regularization into CDT shows empirical benefits, particularly in handling out-of-distribution actions and improving the robustness against approximation errors. This stochastic nature provides a level of flexibility and robustness previously unobserved in deterministic counterparts.

Experimental Evaluation

The CDT framework was tested in various challenging environments with robotic locomotion tasks. The results demonstrated CDT's superiority over several baselines, including constrained optimization methods and pessimism-based offline RL techniques. Notably, CDT achieved superior results across all evaluated domains, with observed improvements in both safety adherence and reward optimization. Moreover, CDT exhibited zero-shot adaptation capabilities, dynamically responding to different cost thresholds in a manner that was previously unattainable without re-training.

Implications and Future Directions

This research underscores the power of combining sequence modeling and multi-objective optimization to address significant limitations in offline safe RL. The innovations introduced by CDT can potentially extend beyond RL, influencing broader fields where safety constraints intersect with performance optimization.

Looking forward, further research could explore theoretical guarantees surrounding CDT's safety performance and its potential application in more complex, dynamic environments. Moreover, addressing the computational demands associated with its Transformer-based architecture is a practical research direction for broader real-world applicability.

In conclusion, the paper advances the understating of constrained decision-making in machine learning by offering a method that reconciles the safety-performance trade-off effectively, adaptable to various constraints without the need for exhaustive retraining strategies. Such developments hold promise not only for reinforcement learning but also for a wide array of applications requiring reliable decision-making under uncertainty and constraints.

Github Logo Streamline Icon: https://streamlinehq.com