Temporal Difference Learning for Model Predictive Control (2203.04955v2)

Published 9 Mar 2022 in cs.LG and cs.RO

Abstract: Data-driven model predictive control has two key advantages over model-free methods: a potential for improved sample efficiency through model learning, and better performance as computational budget for planning increases. However, it is both costly to plan over long horizons and challenging to obtain an accurate model of the environment. In this work, we combine the strengths of model-free and model-based methods. We use a learned task-oriented latent dynamics model for local trajectory optimization over a short horizon, and use a learned terminal value function to estimate long-term return, both of which are learned jointly by temporal difference learning. Our method, TD-MPC, achieves superior sample efficiency and asymptotic performance over prior work on both state and image-based continuous control tasks from DMControl and Meta-World. Code and video results are available at https://nicklashansen.github.io/td-mpc.

PDF Abstract

Temporal Difference Learning for Model Predictive Control

The paper "Temporal Difference Learning for Model Predictive Control" presents an innovative framework, TD-MPC, that integrates temporal difference (TD) learning with model predictive control (MPC) to enhance reinforcement learning (RL) in continuous control tasks. This approach effectively melds the strengths of both model-based and model-free methods, aiming to achieve superior sample efficiency and overall performance.

Technical Contributions

The authors introduce a task-oriented latent dynamics model used for local trajectory optimization within a short horizon, paired with a terminal value function for estimating long-term returns. Both components are learned through TD-learning, a significant departure from traditional state or image prediction methods. By focusing solely on reward prediction, the approach is more sample efficient and can mitigate compounding errors typically associated with model inaccuracies.

Methodology

The framework is underpinned by:

Model Predictive Control (MPC): It leverages a short planning horizon to perform trajectory optimization, ensuring that the computational cost remains manageable while still striving for near-optimal trajectories by integrating a terminal value function.
Task-Oriented Latent Dynamics (TOLD): Unlike previous methodologies that predict entire states or observations, TOLD predicts only task-relevant components, specifically rewards, thus avoiding the pitfall of modeling irrelevant environmental features.
Temporal Difference Learning: It jointly optimizes both the model and a terminal state-action value function, which is unique in using TD learning rather than relying on policy gradient methods.

Experimental Evaluation

The paper reports empirical results from a series of continuous control tasks in DMControl and Meta-World benchmarks. TD-MPC outperforms existing model-free (e.g., SAC) and model-based methods (e.g., Dreamer) in both sample efficiency and asymptotic performance, even solving the high-dimensional Dog and Humanoid tasks with remarkable efficiency. Notably, it does this while maintaining a competitive edge in image-based RL scenarios without specifically tailoring its hyperparameters for such tasks.

Implications

The integration of a latent dynamics model and terminal value function within a TD-learning context represents a shift towards more efficient RL algorithms capable of scaling across diverse control tasks. This is particularly beneficial in environments where computational resources or sample availability are constrained, addressing key limitations of traditional model-based methods.

Future Directions

The paper hints at a rich vein of potential research opportunities, including improving exploration strategies within TD-MPC, further reducing the performance gap between learned policies and planning, and exploring architectural innovations to bolster model learning. Moreover, applying this approach to broader tasks beyond continuous control or incorporating richer sensory inputs could amplify its impact.

In conclusion, this paper delineates a promising advancement in the RL domain by harmoniously combining MPC and TD-learning, thus enabling high performance in continuous control via a method that is both computationally feasible and sample efficient.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Nicklas Hansen (22 papers)
Xiaolong Wang (243 papers)
Hao Su (217 papers)

Citations (165)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

TD-MPC

Tweets

https://twitter.com/ncklashansen/status/1795860968875917444

YouTube

Show All Videos