Decision Transformer: Reinforcement Learning via Sequence Modeling (2106.01345v2)

Published 2 Jun 2021 in cs.LG and cs.AI

Abstract: We introduce a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in LLMing such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.

PDF Abstract

Decision Transformer: Reinforcement Learning via Sequence Modeling

The paper "Decision Transformer: Reinforcement Learning via Sequence Modeling" introduces an innovative framework which abstracts Reinforcement Learning (RL) as a sequence modeling problem. This novel approach leverages the flexibility and scalability of the Transformer architecture, which has gained substantial success in LLMing and image generation, to address RL challenges.

Key Contributions

The primary contribution of this paper is the Decision Transformer, an architecture that redefines the RL problem as conditional sequence modeling. Unlike traditional RL methods that primarily focus on fitting value functions or computing policy gradients, the Decision Transformer generates optimal actions by utilizing a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, the Decision Transformer can predict future actions that align with the desired return. This methodology enables the framework to bypass the complicated and often unstable bootstrapping mechanisms used in RL for credit assignment.

Methodology

The essence of the Decision Transformer lies in its treatment of the RL problem. The approach can be broken down into several key steps:

Trajectory Representation:
- The trajectory is represented as a sequence: (Return, State, Action). To train the model, this sequence is fed into the Transformer, which applies modality-specific linear embeddings and positional encodings.
Training:
- The training phase involves conditioning the autoregressive model on collected trajectories from offline datasets. Minibatches of sequences are sampled from this data, and the model is trained to predict actions based on past states, rewards, and actions, reducing RL to a sequence prediction task.
Evaluation:
- The Decision Transformer is evaluated on well-established RL benchmarks such as Atari, OpenAI Gym, and the Key-to-Door task. It shows significant advantages in scenarios where long-term credit assignment is critical.

Evaluation and Results

The Decision Transformer is tested against state-of-the-art offline RL baselines like Conservative Q-Learning (CQL) and other model-free RL algorithms. Remarkably, the Decision Transformer matches or surpasses the performance of these baselines in various contexts:

Atari Games:

The Decision Transformer achieves high normalized scores on games like Pong and Breakout, demonstrating an ability to handle high-dimensional observations and complex credit assignment challenges.
OpenAI Gym:

In continuous control tasks from the D4RL benchmark, the Decision Transformer consistently outperforms other methods in the majority of scenarios, showcasing its robustness on tasks requiring fine-grained control.
Sparse Reward Tasks:

The Key-to-Door environment illustrates the model's proficiency in long-term credit assignment, an area where conventional TD-learning methods struggle.

Discussion and Implications

Comparison with Behavior Cloning

A notable comparison is with Percentile Behavior Cloning (\%BC), which involves cloning the top-performing trajectories from the dataset. While \%BC can be effective, particularly in data-rich settings, the Decision Transformer often matches or surpasses the performance of \%BC by leveraging all data points to inform its sequence modeling, thereby enhancing its generalization capabilities.

Robustness to Delayed Rewards

One of the compelling findings is the model's robustness to delayed rewards. Traditional RL methods relying on dense reward signals can falter in environments with sparse or delayed rewards. The Decision Transformer, however, maintains its performance by focusing on return-to-go conditioning, making it less susceptible to reward sparsity.

Future Directions

The results from this paper suggest multiple avenues for future research. Sequence modeling with transformers could be extended to incorporate dynamic models and state evolution trajectories, potentially aiding in model-based RL applications. Additionally, the Decision Transformer could benefit from the rich literature of self-supervised learning tasks, which may enhance its pretraining and adaptability.

Conclusion

In summary, the Decision Transformer presents a significant shift in RL methodology by framing it as a sequence modeling problem. Through extensive evaluations, it demonstrates competitive or superior performance relative to traditional RL methods across diverse benchmarks. This work bridges the gap between sequence modeling advancements and RL, proposing a robust and scalable framework for future RL research.