Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs (1909.02769v2)

Published 6 Sep 2019 in cs.LG, math.OC, and stat.ML

Abstract: Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be 'close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish $\tilde O(1/\sqrt{N})$ convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in regularized MDPs for which we prove fast rates of $\tilde O(1/N)$, much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.

Citations (167)

View on Semantic Scholar

Summary

The paper proves global convergence for unregularized TRPO and faster ilde{O}(1/N) rates for regularized MDPs, bridging theoretical rigor with practical RL.
It demonstrates that TRPO's adaptive scaling mirrors convex optimization trust regions and establishes sample-based convergence competitive with Conservative Policy Iteration (CPI).
Regularization is shown to accelerate convergence but introduces potential bias, underscoring practical relevance for large-scale RL tasks with entropy considerations.

Adaptive Trust Region Policy Optimization: Convergence Analysis

In this paper, the authors explore Trust Region Policy Optimization (TRPO), a prominent algorithm in reinforcement learning (RL), emphasizing its foundations in convex optimization and proving its convergence properties. The focus is on establishing both global convergence and improved rates for regularized Markov Decision Processes (MDPs).

Algorithmic Foundations and Analysis

TRPO Overview:

The paper starts by demystifying TRPO, traditionally considered heuristic, demonstrating that its scaling mechanism dovetails naturally with traditional trust-region approaches from convex optimization. TRPO aims to optimize policies in RL by iteratively solving surrogate problems, ensuring policies remain proximate over iterations. This proximity is maintained using an adaptive scaling of the Bregman distance, mirroring its historical applications in convex optimization contexts.

Adaptive Mechanism:

The central feature of TRPO, its adaptive scaling, is effectively the reinforcement learning adaptation of trust-region strategies, helping balance policy improvements while ensuring close successive policy updates. This is key for not only uniform convergence when access to the full state space exists but also for sample-based versions when only limited state space access is practical.

Convergence Results

Convergence Rates:

The paper delivers rigorous analysis, establishing $\tilde{O}(1/\sqrt{N})$ convergence rate for unregularized TRPO, akin to general results seen in traditional mirror descent applications. More significantly, for regularized MDPs, which incorporate entropy regularization and other cost adjustments, the convergence accelerates to $\tilde{O}(1/N)$ —a remarkable result echoing gains seen in strongly convex optimization problems.

Regularization is shown to aid convergence but introduces a potential bias, as the optimal solution under regularized settings might differ from unregularized outcomes. This is a critical extension, marking the first formal evidence of accelerated rates in RL when leveraging regularization.

Sample-Based TRPO:

The research further extends TRPO into a sample-based domain, where ideal theoretical models are infeasible due to extensive state spaces. In this context, TRPO adapts through effective sampling, culminating in convergence benefits comparable to those offered by Conservative Policy Iteration (CPI)—but without necessitating strict policy improvement steps. This adaptation maintains the practical flexibility popularized within empirical RL applications.

Implications and Speculations

Theoretical and Practical Impact:

Theoretical implications are profound, bridging empirical successes with rigorous foundations. Practically, TRPO's variant harbors potential to be leveraged in large-scale RL tasks where computational efficiencies are paramount, further enriched by entropy considerations. The established sample complexities set the stage for practical implementations, balancing theoretical rigor with real-world feasibility.

Future Directions:

The paper raises intriguing possibilities for further exploration. Areas like linear convergence, impossibility theorems for RL rates, and deeper analyses on regularization in RL invite ongoing research. Particularly, the paper prompts reassessment of heuristic considerations in RL algorithms, urging a critical look at adaptive proximity's utility in balancing optimization rigor with empirical efficacy.

In summary, the paper offers profound insights into the algorithmic structures and convergence properties of TRPO, elucidating its foundational mechanisms while setting the stage for enhanced applications in reinforcement learning's ever-expanding landscape.