Mirror Ascent in RL & Optimization
- Mirror ascent is a first-order optimization framework that uses Bregman divergences induced by strictly convex mirror maps for robust policy updates in RL and optimization.
- It underpins methods like SPMA and MoMA, offering convergence guarantees and efficient updates in both online and offline reinforcement learning settings.
- The approach integrates various policy parameterizations and bridges traditional projection methods with modern natural policy gradients to enhance robustness and efficiency.
Mirror ascent is a fundamental framework in stochastic optimization and reinforcement learning (RL) that generalizes projected gradient methods through the use of Bregman divergences induced by strictly convex mirror maps. In modern RL, mirror ascent methods enable robust and efficient policy updates in both online and offline paradigms, supporting discrete and continuous action spaces, linear and non-linear parameterizations, and adversarially robust offline learning. This article surveys the mathematical structure of mirror ascent, its principal instantiations, convergence guarantees, and practical algorithmic variants as documented in recent research.
1. Mathematical Structure and Mirror Maps
Mirror ascent is a first-order optimization method for constrained problems, defined over a convex feasible set . The method replaces Euclidean projections with Bregman projections derived from a strictly convex "mirror map" . The iterative update is
where is the Bregman divergence. In the context of policy optimization, the mirror map is typically the (weighted) negative entropy or log-sum-exp function, inducing a Kullback-Leibler (KL) divergence.
Tabular policies for Markov Decision Processes (MDPs) represent the policy as a matrix of logits per state , with the softmax mapping . The mirror map
is typically weighted by the discounted state-occupancy . The corresponding Bregman divergence is a weighted average KL divergence between policies (Asad et al., 2024).
2. Policy Mirror Ascent in Reinforcement Learning
Mirror ascent arises naturally in several policy gradient algorithms. A primary example is the Softmax Policy Mirror Ascent (SPMA) method, which expresses the policy update in the logit (dual) space: where is the advantage function. For small enough step-size , this update preserves normalization without an explicit projection.
The mirror ascent update can be described equivalently as gradient ascent in the dual variables, followed by mirror mapping to the primal policy space. In model-based settings, such as offline RL, the mirror ascent step can be performed after adversarial (conservative) value evaluation to guarantee robustness over a confidence set of plausible transition models (Hong et al., 2024).
Policy mirror ascent methods can be efficiently extended to parametric policy classes:
- Log-linear policies: , where updates are performed via a convex softmax classification problem.
- Non-linear function approximation: , where KL surrogate minimization is used for projection.
3. Convergence Analysis and Theoretical Guarantees
For tabular MDPs, SPMA achieves a linear rate of convergence under appropriate conditions: where depends on the minimum occupancy and Q-value gap. Selecting below an explicit threshold guarantees contraction, resulting in iterations for -optimality (Asad et al., 2024).
In the log-linear parameterization, mirror ascent's convergence is to a neighborhood determined by estimation and approximation errors: If the feature class is Bellman-complete and optimization is exact, the residual vanishes and linear-rate global convergence is recovered.
For non-linear policies, convergence is to a stationary point or neighborhood, under conditions such as the Polyak-Łojasiewicz (PL) inequality or over-parametrization.
In offline model-based RL, the MoMA (Model-based Mirror Ascent) algorithm provides a suboptimality guarantee that explicitly quantifies model, policy-optimization, and approximator errors: assuming realizability and partial coverage (Hong et al., 2024).
4. Algorithmic Implementations and Practical Details
Mirror ascent-type methods lend themselves to scalable algorithmic instantiations, varying by the choice of policy parameterization and access to dynamics:
| Setting | Update Rule/Algorithmic Step | Computational Cost |
|---|---|---|
| Tabular SPMA | per step | |
| Log-linear SPMA | Inner-loop KL projection via softmax regression | per iteration |
| MoMA (offline RL) | Primal-dual policy evaluation + Bregman-prox policy update | Polynomial in , , |
In MoMA, policy evaluation is formulated as a minimization over a confidence set of models, often solved by primal-dual gradient dynamics. The policy update is a regularized maximization with respect to a strongly convex potential, frequently chosen as the negative entropy (yielding KL divergence regularization).
For practical tractability, policies are fit via regression on regularized target values, and policies are updated via a convex program (closed-form for softmax with negative entropy regularizer).
5. Empirical Performance and Benchmarks
Experiments on benchmark domains demonstrate that mirror ascent methods yield strong empirical performance in both online and offline RL:
- Tabular and high-dimensional RL: SPMA matches or outperforms established policy optimization baselines including MDPO, PPO, and both regularized and constrained TRPO on Atari (discrete, CNN) and MuJoCo (continuous, MLP) benchmarks (Asad et al., 2024).
- Accelerated Softmax Policy Gradient: SPMA shows exponentially faster convergence rates relative to softmax policy gradient and its accelerated variants in theory and matches or exceeds their empirical returns.
- Offline RL (MoMA): MoMA achieves state-of-the-art or competitive performance under partial coverage, outperforming approaches that restrict policy classes or lack robust model estimation steps (Hong et al., 2024).
An important point is that, for SPMA, the explicit normalization step is often unnecessary due to invariants preserved by the mirror ascent update, simplifying implementation and increasing computational efficiency.
6. Extensions, Limitations, and Future Directions
Extensions of mirror ascent include its application with general function approximation, in non-convex policy spaces, and in adversarial or robust settings. For instance, MoMA achieves theoretical guarantees while employing unrestricted policy classes and general function approximators in conjunction with uncertainty-calibrated model learning (Hong et al., 2024).
In continuous action spaces, policy mirror ascent faces subtle challenges related to unbounded score functions and persistent bias. Heavy-tailed policy parameterizations can mitigate some biases but may induce instability; stabilizing such updates via mirror ascent-type schemes and gradient tracking is an active direction (Bedi et al., 2022).
Open avenues include:
- Adaptive step-size selection
- Off-policy mirror ascent
- Theoretical analysis for highly non-linear or over-parameterized regimes
- Robustness to sampling and model misspecification
A plausible implication is that the efficiency and robustness of policy optimization crucially depend on the choice of mirror map and its compatibility with function approximation and statistical estimation settings.
7. Comparison and Relationship to Other Methods
Mirror ascent unifies the perspective on various policy gradient methods:
- Natural Policy Gradient (NPG): Interpreted as mirror ascent with the Fisher information as a metric on the probability simplex, matching SPMA in convergence rates but differing in update structure and requirements for compatible functions (Asad et al., 2024).
- Regularized Policy Optimization (TRPO, PPO, MDPO): Regularization terms such as KL-divergence are special cases of Bregman divergences induced by mirror maps; these methods can often be derived as mirror descent or ascent in the policy space.
- Softmax Policy Gradient (SPG): A special case of (Euclidean) gradient ascent, which exhibits sublinear convergence, in contrast to the linear convergence of SPMA.
The connection with model-based RL via MoMA illustrates mirror ascent's role in algorithmic stability and the ability to leverage unrestricted policy spaces in offline, adversarially robust contexts (Hong et al., 2024).
References:
- "Fast Convergence of Softmax Policy Mirror Ascent" (Asad et al., 2024)
- "MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning" (Hong et al., 2024)
- "On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces" (Bedi et al., 2022)