TD-MPC2: Scalable, Robust World Models for Continuous Control (2310.16828v2)

Published 25 Oct 2023 in cs.LG, cs.AI, cs.CV, and cs.RO

Abstract: TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents. Explore videos, models, data, code, and more at https://tdmpc2.com

PDF Abstract

Overview of TD-MPC2: Scalable, Robust World Models for Continuous Control

The paper presents TD-MPC2, an innovative model-based reinforcement learning (RL) algorithm that significantly enhances the original TD-MPC approach. TD-MPC2 prioritizes scalability, robustness, and simplification of the learning process in RL, achieving impressive performance across 104 continuous control tasks without the need for per-task hyperparameter tuning. This algorithm marks a substantial advancement toward the deployment of generalist embodied agents trained on diverse tasks through uniform sets of hyperparameters.

Key Contributions

TD-MPC2 is designed around several crucial advancements over its predecessor, TD-MPC:

Algorithmic Enhancements:
- The TD-MPC2 algorithm improves robustness through refined architectural choices and normalization techniques that stabilize training, notably Simplicial Normalization (SimNorm), which aids in management of latent state representations.
- It employs a scalable architecture capable of handling up to 317 million parameters effectively, thereby supporting extensive multitask RL settings.
Task Diversity Accommodation:
- The algorithm handles a variety of tasks across different domains, embodiments, and action spaces, maintaining consistent performance across all without task-specific tuning.
- Learnable task embeddings enable TD-MPC2 to understand task semantics dynamically, supporting multitask learning.
Planning and Control Improvements:
- By adopting a maximum entropy policy for its policy prior, TD-MPC2 retains robustness against action space variation and overcomes traditional deterministic policy limitations.
- It integrates Model Predictive Control (MPC) with an enhanced policy prior to optimize action sequences adaptively, utilizing latent space trajectory planning.

Experimental Validation

The paper validates TD-MPC2 across a comprehensive suite of tasks, spanning domains like DMControl, Meta-World, and MyoSuite. Notably, the algorithm stands out in handling complex locomotion and multi-object manipulation tasks, signaling its robustness.

TD-MPC2 consistently surpasses state-of-the-art model-free (e.g., SAC) and model-based (e.g., DreamerV3) RL algorithms. The experimental results indicate that with an appropriately scaled architecture, TD-MPC2 demonstrates enhanced data-efficiency and stability.

Implications and Future Directions

The implications of TD-MPC2 are substantial for both practical applications and theoretical foundations in RL:

Practical Utility: By ensuring robust multitask learning and scalability, TD-MPC2 can cater to real-world applications requiring diverse skill sets, making it particularly valuable for robotics where tasks may vary widely.
Theoretical Advancements: The design choices, such as SimNorm and task embeddings, highlight new directions for RL research, emphasizing the importance of stability and robustness in model architectures.

Future developments could focus on expanding TD-MPC2 to embrace discrete action spaces, aligning with the goal of developing universally applicable RL strategies. Additionally, explorations into converging implicit world models with cognitive task performance could open avenues for more integrated AI systems.

In conclusion, TD-MPC2 represents a significant stride in the evolution of scalable and robust world models in RL, suggesting promising pathways for future algorithms and applications in AI. Its demonstrated effectiveness across a wide spectrum of tasks underscores the potential of well-designed model-based approaches to meet the complex demands of contemporary and future reinforcement learning challenges.