Entropy annealing for policy mirror descent in continuous time and space (2405.20250v2)

Published 30 May 2024 in math.OC, cs.LG, and math.PR

Abstract: Entropy regularization has been extensively used in policy optimization algorithms to regularize the optimization landscape and accelerate convergence; however, it comes at the cost of introducing an additional regularization bias. This work quantifies the impact of entropy regularization on the convergence of policy gradient methods for stochastic exit time control problems. We analyze a continuous-time policy mirror descent dynamics, which updates the policy based on the gradient of an entropy-regularized value function and adjusts the strength of entropy regularization as the algorithm progresses. We prove that with a fixed entropy level, the dynamics converges exponentially to the optimal solution of the regularized problem. We further show that when the entropy level decays at suitable polynomial rates, the annealed flow converges to the solution of the unregularized problem at a rate of $\mathcal O(1/S)$ for discrete action spaces and, under suitable conditions, at a rate of $\mathcal O(1/\sqrt{S})$ for general action spaces, with $S$ being the gradient flow time. This paper explains how entropy regularization improves policy optimization, even with the true gradient, from the perspective of convergence rate.

Authors (3)

Deven Sethi (3 papers)
David Šiška (28 papers)
Yufei Zhang (102 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper proposes a continuous-time policy mirror descent algorithm with entropy annealing to address non-convexity in stochastic exit time control problems.
The methodology involves updating a parameterized Gibbs policy based on the gradient of an entropy-regularized value function, with the regularization strength controlled by a scheduler.
Key results show that the algorithm converges exponentially with fixed entropy and achieves O(1/S) or O(1/√S) convergence rates to the unregularized problem solution when the entropy decays polynomially.

This paper investigates the convergence of policy gradient methods for stochastic exit time control problems, focusing on the impact of entropy regularization. The authors propose a continuous-time policy mirror descent algorithm that updates the policy based on the gradient of an entropy-regularized value function, with the regularization strength adjusted over time. The core idea is to balance the benefits of entropy regularization in smoothing the optimization landscape against the bias it introduces.

Here's a breakdown of the key elements:

Problem: The paper addresses the challenge of non-convexity in policy optimization for continuous-time control problems, which makes it difficult to guarantee convergence of policy gradient methods.
Approach: The authors use entropy regularization to modify the optimization landscape. They formulate a policy mirror descent algorithm in continuous time and space, where the policy is updated based on the gradient of an entropy-regularized value function. The entropy regularization strength is controlled by a scheduler.
Methodology: The analysis focuses on exit time relaxed control problems, where the objective is to control a stochastic process until it exits a given region. The policy is parameterized using a Gibbs policy, and the algorithm updates a feature function that parameterizes the policy.
Main Results:
- The mirror descent flow with a continuous scheduler admits a unique solution when the state process has non-degenerate noise.
- With a fixed entropy level, the dynamics converge exponentially to the optimal solution of the regularized problem.
- When the entropy level decays at appropriate polynomial rates, the flow converges to the solution of the unregularized problem at a rate of O(1/S) for discrete action spaces and, under suitable conditions, at a rate of O(1/√S) for general action spaces, where S is the gradient flow time.
Key Techniques:
- Error decomposition: The error is decomposed into optimization error and regularization bias.
- Lyapunov function: A Kullback-Leibler divergence is used as a Lyapunov function to analyze convergence.
- Performance difference lemma: This lemma is used to relate the difference in value functions to the difference in policies.
- Asymptotic expansions: Asymptotic expansions of the regularized Hamiltonians are used to derive decay rates for the regularization bias.
Significance: This paper provides a theoretical foundation for understanding how entropy regularization improves policy optimization in continuous-time control problems. The convergence rate analysis offers insights into how to choose the strength of entropy regularization and how to adjust it over time to achieve optimal performance.
Novelty:
- The paper presents a convergence analysis of policy gradient methods with entropy regularization for continuous-time control problems.
- It provides explicit convergence rates depending on the choice of entropy scheduler, balancing the optimization error and the regularization bias.
- It introduces a weak formulation of the control problem, allowing for measurable policies and broadening the applicability of the analysis.

In essence, the paper provides a rigorous mathematical framework for understanding and optimizing entropy-regularized policy gradient methods in continuous-time stochastic control, with specific results on convergence rates and scheduler design.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mathOCb/status/1796424270018744359

https://twitter.com/NHWK/status/1799109256580366489