StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning (2110.06206v3)

Published 12 Oct 2021 in cs.LG and cs.AI

Abstract: Reinforcement Learning (RL) can be considered as a sequence modeling task: given a sequence of past state-action-reward experiences, an agent predicts a sequence of next actions. In this work, we propose State-Action-Reward Transformer (StARformer) for visual RL, which explicitly models short-term state-action-reward representations (StAR-representations), essentially introducing a Markovian-like inductive bias to improve long-term modeling. Our approach first extracts StAR-representations by self-attending image state patches, action, and reward tokens within a short temporal window. These are then combined with pure image state representations -- extracted as convolutional features, to perform self-attention over the whole sequence. Our experiments show that StARformer outperforms the state-of-the-art Transformer-based method on image-based Atari and DeepMind Control Suite benchmarks, in both offline-RL and imitation learning settings. StARformer is also more compliant with longer sequences of inputs. Our code is available at https://github.com/elicassion/StARformer.

PDF Abstract

An Analysis of "StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning"

The paper introduces an innovative approach to visual reinforcement learning (RL) by proposing StARformer, a transformer architecture designed to explicitly model state-action-reward (StAR) representations. This method introduces a Markovian-like inductive bias, aiming to enhance long-term sequence modeling. The architecture breaks down the task into two main components: the Step Transformer and the Sequence Transformer, which jointly aim to capture both the short-term and long-term dependencies within reinforcement learning tasks.

Methodology Overview

The StARformer architecture fundamentally restructures the way transformer models are applied in the context of reinforcement learning. The methodology involves several key components:

State-Action-Reward Encoding: The Step Transformer processes state-action-reward tokens within short temporal windows using ViT-like patch embeddings, which retain spatial granularity. This captures local connections and interactions that are inherently strong due to stepwise causal dependencies.
Sequential Processing: The Sequence Transformer is utilized to model longer sequence dependencies by analyzing a combined input of aggregated StAR-representations and global state representations from the entire sequence of states.
Combined State Encoding: A dual approach is employed wherein the Step Transformer uses ViT patches for detailed state action correspondence, while the Sequence Transformer uses convolutional features to capture high-level spatial images. This interplay allows for detailed short-term learning with effective long-range contextual modeling.

Experimental Evaluation

The paper evaluates StARformer across various reinforcement learning environments, including Atari and DeepMind Control Suite, using both offline RL and imitation learning protocols. A comparison with existing state-of-the-art methods like Decision-Transformer (DT) shows that StARformer achieves superior results.

Performance on Different Sequence Lengths: An essential finding is the scalability of StARformer with respect to the length of input sequences. It not only maintains but enhances its performance as the input sequence length increases, showcasing its capacity for efficient long-term dependency modeling, which traditional methods like DT struggle to achieve at longer sequence lengths.
Attention Mechanism Efficacy: Attention maps generated during the experiments reveal the model's ability to inherently align actions with critical visual features such as the location of the paddle or ball in the Breakout environment. Such visualization corroborates the model's capacity to comprehend spatial importance in aligning actions with visual cues.

Implications for Future Research

StARformer's methodological innovation opens new avenues for reinforcement learning research:

Exploration of Markovian Modelling: By explicitly modeling strong local connections, StARformer hints at the potential of integrating more nuanced Markovian models within transformer structures for RL tasks.
Balanced Representation Fusion: The successful merging of short-term detailed and long-term abstract representations indicates potential benefits for other sequence prediction tasks, not limited to reinforcement learning.
Broader Applicability of Transformer-based RL: Given the results, considerations of StARformer-like architectures can be extended to domains where sequential and spatial precision jointly dictate performance, such as autonomous driving or robotics.

Conclusion

StARformer presents a robust framework that merges the detailed modeling of immediate causal interactions with comprehensive sequence evaluations for reinforcement learning. By effectively bridging the gap between short-term precision and long-term context, the architecture advances the state-of-the-art in sequential decision-making tasks with visual inputs. Researchers can build upon this model to further enhance the interpretability and efficacy of RL models in increasingly complex environments.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jinghuan Shang (10 papers)
Kumara Kahatapitiya (20 papers)
Xiang Li (1002 papers)
Michael S. Ryoo (75 papers)

Citations (27)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - elicassion/StARformer: [ECCV2022] [T-PAMI] StARformer: Transformer with State-Action-Reward Representations. (95 stars)