Actor-Critic Model Predictive Control: Differentiable Optimization meets Reinforcement Learning (2306.09852v6)

Published 16 Jun 2023 in cs.RO

Abstract: An open research question in robotics is how to combine the benefits of model-free reinforcement learning (RL) -- known for its strong task performance and flexibility in optimizing general reward formulations -- with the robustness and online replanning capabilities of model predictive control (MPC). This paper provides an answer by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an actor-critic RL framework. This integration allows for short-term predictive optimization of control actions through MPC, while leveraging RL for end-to-end learning and exploration over longer horizons. Through various ablation studies, we expose the benefits of the proposed approach: it achieves better out-of-distribution behaviour, better robustness to changes in the dynamics and improved sample efficiency. Additionally, we conduct an empirical analysis that reveals a relationship between the critic's learned value function and the cost function of the differentiable MPC, providing a deeper understanding of the interplay between the critic's value and the MPC cost functions. Finally, we validate our method in the drone racing task in various tracks, in both simulation and the real world. Our method achieves the same superhuman performance as state-of-the-art model-free RL, showcasing speeds of up to 21 m/s. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out of distribution behavior.

PDF HTML Abstract

Actor-Critic Model Predictive Control: A Synthesis of RL and MPC

The paper "Actor-Critic Model Predictive Control" proposes a novel framework that synergistically combines model-free reinforcement learning (RL) and model predictive control (MPC) to enhance robotics control systems. This approach is particularly aimed at leveraging the complementary strengths of both methods: RL’s proficiency in optimizing flexible reward structures and MPC’s robustness in online replanning.

Overview of the Framework

The core contribution of this work is the introduction of a differentiable MPC embedded within an actor-critic RL architecture. This integration allows the framework to optimize short-term predictive actions through MPC while managing long-term outcomes via a critic network. The actor is equipped with an MPC-based decision-making process, allowing for robust short-term adjustments, and the critic focuses on the long-term implications of actions, leading to a more balanced control strategy.

This methodology is validated using a quadcopter platform across various high-level tasks, both in simulation and in real-world settings. The results demonstrate the framework's ability to perform real-time control, learn complex behaviors, and maintain the MPC's predictive qualities to handle scenarios outside the training distribution effectively.

Numerical Results and Claims

The paper underscores the system's ability to execute agile flight tasks while maintaining robustness to disturbances and generalization to novel conditions. Comparative studies reveal:

Actor-Critic MPC (AC-MPC) significantly outperforms traditional RL approaches (AC-MLP) in scenarios with unseen disturbances, such as strong wind forces, showing an 83.33% success rate in adverse conditions.
AC-MPC demonstrates improved success rates in completing challenging tracks compared to a conventional tracking MPC, particularly in experiments that introduce variations in initial conditions.
The architecture exhibits robust sim-to-real transfer, requiring no additional tuning when transitioning from simulated environments to real-world applications.

Implications and Future Directions

The integration of MPC within RL opens several avenues for improving the robustness and adaptability of learned control policies in robotics. The ability to leverage model-based predictive capabilities provides a substantial advantage in environments where unforeseen variables can lead to suboptimal performance.

From a theoretical perspective, the synthesis of short-term and long-term decision-making via an actor-critic approach may be applicable to other domains requiring dynamic adaptability. This work paves the way for further exploration of modular control architectures, where learning and model-based strategies are not mutually exclusive but rather synergistically integrated.

Potential future developments could focus on extending the framework to handle more complex dynamics and constraints. Improvements in the computational efficiency of differentiable MPCs will also be critical in expanding the applicability of this approach to a wider range of robotic systems.

Moreover, exploring how this integrated framework can be applied to different robotic platforms and environments will be crucial for validating its generalizability and robustness in diverse operational contexts.