Mirror Descent Policy Optimization (2005.09814v5)

Published 20 May 2020 in cs.LG, cs.AI, and stat.ML

Abstract: Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). However, there remains a considerable gap between such theoretically analyzed algorithms and the ones used in practice. Inspired by this, we propose an efficient RL algorithm, called {\em mirror descent policy optimization} (MDPO). MDPO iteratively updates the policy by {\em approximately} solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive {\em on-policy} and {\em off-policy} variants of MDPO, while emphasizing important design choices motivated by the existing theory of MD in RL. We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact {\em not} a necessity for high performance gains in TRPO. We then show how the popular soft actor-critic (SAC) algorithm can be derived by slight modifications of off-policy MDPO. Overall, MDPO is derived from the MD principles, offers a unified approach to viewing a number of popular RL algorithms, and performs better than or on-par with TRPO, PPO, and SAC in a number of continuous control tasks. Code is available at \url{https://github.com/manantomar/Mirror-Descent-Policy-Optimization}.

Authors (4)

Manan Tomar (14 papers)
Lior Shani (16 papers)
Yonathan Efroni (38 papers)
Mohammad Ghavamzadeh (97 papers)

Citations (73)

View on Semantic Scholar

Summary

The paper introduces MDPO, a novel reinforcement learning algorithm that approximates trust-region problems using mirror descent for stable policy updates.
It demonstrates a unique link between on-policy trust-region methods like TRPO/PPO and off-policy SAC through minor modifications in the MDPO framework.
Empirical results across continuous control tasks show MDPO rivals or surpasses TRPO, PPO, and SAC, highlighting its competitive efficacy.

Mirror Descent Policy Optimization

The paper introduces a novel reinforcement learning (RL) algorithm termed Mirror Descent Policy Optimization (MDPO). This algorithm is derived from mirror descent (MD), a first-order method in constrained convex optimization that has gained traction as a tool for analyzing trust-region algorithms within reinforcement learning contexts.

Methodological Contributions

MDPO innovatively iterates on policy updates by approximately solving a trust-region problem. The objective function in this paradigm integrates two primary components: a linear approximation of the traditional RL objective and a proximity term that ensures the proximity of consecutive policies. The proximity condition prevents abrupt shifts, thus fostering stable learning. MDPO leverages multiple gradient steps to fulfill this approximation, facilitating efficient policy optimization.

The authors present both on-policy and off-policy variants of MDPO, highlighting the theoretical underpinnings that guide design choices grounded in the MD framework. Notably, the paper draws attention to connections between on-policy MDPO and well-established trust-region algorithms such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO). It is shown that maintaining a strict trust-region constraint is not essential to achieve substantial performance improvements in TRPO, a claim with significant implications for RL algorithm design.

Relation to Other Algorithms

An intriguing aspect of this work is the revelation that the popular Soft Actor-Critic (SAC) algorithm can be derived by modest alterations in the off-policy MDPO. This alignment not only validates MDPO's foundational principles but also establishes it as a unifying lens through which various RL algorithms can be understood.

Empirical Results

The empirical evaluation of MDPO displays its competence across several continuous control tasks. The results demonstrate that MDPO achieves performance metrics that rival or surpass those of TRPO, PPO, and SAC. These findings position MDPO as a competitive addition to the RL algorithm suite, capable of addressing continuous control challenges effectively.

Implications and Future Directions

Theoretical and practical implications of MDPO's development are manifold. Its foundation on mirror descent highlights the utility of optimization principles in structuring RL algorithms. This perspective encourages further exploration into leveraging optimization theories in RL, potentially spawning new approaches or refinements to existing methods.

Future explorations could investigate the integration of MDPO with other RL frameworks, or extend its application to domains beyond continuous control. The adaptability and theoretical rigor of MDPO may render it suitable for diverse application scenarios, and continued research may enhance its versatility and efficiency.

Overall, MDPO provides a noteworthy advancement in RL by elegantly synthesizing optimization techniques with practical algorithm design, yielding insights that could influence the trajectory of RL research and methodology development.

PDF Markdown

Related Papers

GitHub

GitHub - manantomar/Mirror-Descent-Policy-Optimization: Mirror Descent Policy Optimization (38 stars)

Tweets

https://twitter.com/natolambert/status/1800629631365820869