Critic-Guided Decision Transformer for Offline Reinforcement Learning (2312.13716v1)
Abstract: Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of future trajectory distributions. A fundamental challenge arises from the inconsistency between the sampled returns within individual trajectories and the expected returns across multiple trajectories. Fortunately, value-based methods offer a solution by leveraging a value function to approximate the expected returns, thereby addressing the inconsistency effectively. Building upon these insights, we propose a novel approach, termed the Critic-Guided Decision Transformer (CGDT), which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. By incorporating a learned value function, known as the critic, CGDT ensures a direct alignment between the specified target returns and the expected returns of actions. This integration bridges the gap between the deterministic nature of RCSL and the probabilistic characteristics of value-based methods. Empirical evaluations on stochastic environments and D4RL benchmark datasets demonstrate the superiority of CGDT over traditional RCSL methods. These results highlight the potential of CGDT to advance the state of the art in offline RL and extend the applicability of RCSL to a wide range of RL tasks.
- A Distributional Perspective on Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, 449–458. JMLR.org.
- Sequence Modeling is a Robust Contender for Offline Reinforcement Learning. arXiv preprint arXiv:2305.14550.
- When does return-conditioned supervisejanner2021offlined learning work for offline reinforcement learning? Advances in Neural Information Processing Systems, 35: 1542–1553.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34: 15084–15097.
- Bail: Best-action imitation learning for batch deep reinforcement learning. Advances in Neural Information Processing Systems, 33: 18353–18363.
- Rvs: What is essential for offline rl via supervised learning? arXiv preprint arXiv:2112.10751.
- A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning. arXiv:2307.12968.
- D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219.
- Off-policy deep reinforcement learning without exploration. In International conference on machine learning, 2052–2062. PMLR.
- Generalized decision transformer for offline hindsight information matching. arXiv preprint arXiv:2111.10364.
- Graph Decision Transformer. arXiv preprint arXiv:2303.03747.
- Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2): 1–35.
- Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34: 1273–1286.
- Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, 5774–5783. PMLR.
- Offline Reinforcement Learning with Implicit Q-Learning. arXiv:2110.06169.
- Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32.
- Reward-conditioned policies. arXiv preprint arXiv:1912.13465.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191.
- Multi-game decision transformers. Advances in Neural Information Processing Systems, 35: 27921–27936.
- A survey on transformers in reinforcement learning. arXiv preprint arXiv:2301.03044.
- You can’t count on luck: Why decision transformers and rvs fail in stochastic environments. Advances in Neural Information Processing Systems, 35: 38966–38979.
- Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177.
- A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Transactions on Neural Networks and Learning Systems.
- Schmidhuber, J. 2019. Reinforcement Learning Upside Down: Don’t Predict Rewards–Just Map Them to Actions. arXiv preprint arXiv:1912.02875.
- How crucial is transformer in decision transformer? arXiv preprint arXiv:2211.14655.
- Introduction to reinforcement learning, volume 135. MIT press Cambridge.
- Attention is all you need. Advances in neural information processing systems, 30.
- Critic regularized regression. Advances in Neural Information Processing Systems, 33: 7768–7778.
- Large sequence models for sequential decision-making: a survey. Frontiers of Computer Science, 17(6): 176349.
- Elastic Decision Transformer. arXiv preprint arXiv:2307.02484.
- Offline rl with no ood actions: In-sample learning via implicit value regularization. arXiv preprint arXiv:2303.15810.
- Q-learning decision transformer: Leveraging dynamic programming for conditional sequence modelling in offline rl. In International Conference on Machine Learning, 38989–39007. PMLR.
- Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435.
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. arXiv: Learning.
- Online decision transformer. In international conference on machine learning, 27042–27059. PMLR.
- Yuanfu Wang (3 papers)
- Chao Yang (333 papers)
- Ying Wen (75 papers)
- Yu Liu (786 papers)
- Yu Qiao (563 papers)