Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient (2410.08893v4)

Published 11 Oct 2024 in cs.LG, cs.AI, and cs.RO

Abstract: Model-based reinforcement learning (RL) offers a solution to the data inefficiency that plagues most model-free RL algorithms. However, learning a robust world model often requires complex and deep architectures, which are computationally expensive and challenging to train. Within the world model, sequence models play a critical role in accurate predictions, and various architectures have been explored, each with its own challenges. Currently, recurrent neural network (RNN)-based world models struggle with vanishing gradients and capturing long-term dependencies. Transformers, on the other hand, suffer from the quadratic memory and computational complexity of self-attention mechanisms, scaling as $O(n^2)$, where $n$ is the sequence length. To address these challenges, we propose a state space model (SSM)-based world model, Drama, specifically leveraging Mamba, that achieves $O(n)$ memory and computational complexity while effectively capturing long-term dependencies and enabling efficient training with longer sequences. We also introduce a novel sampling method to mitigate the suboptimality caused by an incorrect world model in the early training stages. Combining these techniques, Drama achieves a normalised score on the Atari100k benchmark that is competitive with other state-of-the-art (SOTA) model-based RL algorithms, using only a 7 million-parameter world model. Drama is accessible and trainable on off-the-shelf hardware, such as a standard laptop. Our code is available at https://github.com/realwenlongwang/Drama.git.

Authors (5)

Wenlong Wang (77 papers)
Ivana Dusparic (37 papers)
Yucheng Shi (30 papers)
Ke Zhang (264 papers)
Vinny Cahill (3 papers)

Summary

Mamba-Enabled Model-Based Reinforcement Learning

The paper "Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient" explores a novel approach to addressing the limitations associated with model-based reinforcement learning (RL). Model-based RL has been shown to offer solutions to data inefficiency, which hinders the effectiveness of model-free RL techniques. However, the complexity of learning accurate world models has posed challenges, especially with recurrent neural network (RNN) and transformer-based architectures. This paper presents a state space model (SSM) based on Mamba, which claims enhanced efficiency and robustness over traditional models.

Key Contributions

Introduction of DRAMA: DRAMA utilizes Mamba-based SSMs, notably Mamba-2, to achieve $O(n)$ memory and computational complexity. This helps in capturing long-term dependencies effectively. The authors demonstrate DRAMA's competitiveness on the Atari100k benchmark with only 7 million trainable parameters, highlighting its efficiency.
Comparison of Mamba Variants: By comparing Mamba-1 and Mamba-2 within the DRAMA framework, the authors show that Mamba-2 provides superior performance, despite limitations in expressive power to enhance training efficiency.
Dynamic Frequency-Based Sampling (DFS): The paper introduces DFS, a method to mitigate the suboptimality caused by flawed world models in early training stages. This technique adaptively samples transitions, thereby improving the robustness and efficiency of the learning process.

Methodology

The authors model the environment as a Partially Observable Markov Decision Process (POMDP), where the agent observes high-dimensional inputs and selects actions to maximize expected rewards. The DRAMA framework consists of a variational autoencoder for compressing observations and a dynamics model powered by Mamba-based SSMs. This combination allows for efficient sequence modeling without the computational burdens of transformers or RNNs, thereby enabling DRAMA to operate with more extended training sequences.

Results and Implications

The experimental results on the Atari100k benchmark indicate that DRAMA achieves a normalized score comparable to state-of-the-art methods with significantly fewer parameters. This is especially notable in environments with dense rewards and long-term dependencies.

Performance Evaluations: DRAMA excels in games requiring prediction of complex dynamics, indicating that Mamba-2 effectively captures long-range dependencies. However, it encounters challenges in environments with sparse rewards or sophisticated exploration needs.
Model Efficiency: The authors prove the sample and parameter efficiency of Mamba-enabled RL models. This efficiency might lead to applications in settings where resources are limited, such as embedded systems or real-world robotic tasks.

Future Directions

The research opens avenues for further exploration of Mamba-based architectures in RL, particularly in tasks demanding long-horizon planning and dynamic adjustments to models. Additionally, investigating the intersection of SSMs with multitask and meta-learning could yield significant advancements.

Understanding how Mamba can bolster more informed exploration strategies in RL is another promising direction. This paper demonstrates the potential of using structured state space models in model-based RL, suggesting that these methods could eventually address longstanding challenges in the field, such as improving exploration efficacy and policy learning in complex environments.

Overall, this paper provides a substantial contribution to the ongoing conversation about the capabilities and evolution of model-based RL algorithms. By emphasizing efficiency and robustness, it lays a foundation for future work that aims to harness these strengths for broader, more dynamic applications in artificial intelligence.

Related Papers

GitHub

GitHub - realwenlongwang/drama (1 star)

Tweets

https://twitter.com/gm8xx8/status/1845997521568780346

YouTube

Show All Videos