Online Policy Mirror Descent

Updated 4 July 2026

OPMD is an online optimization framework that applies mirror descent in policy space by balancing a linearized objective and a Bregman divergence to a reference policy.
Its methodology leverages various mirror maps—such as Euclidean and negative entropy—to yield optimal regret guarantees and dimension-free convergence rates in reinforcement learning.
OPMD underpins diverse algorithms in discounted RL, self-play, and dynamic settings, unifying policy updates through a geometric perspective that compares policies via first-order surrogates.

Online Policy Mirror Descent (OPMD) denotes the use of online mirror-descent updates in policy space: at each iteration, a policy is updated by optimizing a linearized payoff or loss plus a Bregman divergence to a reference policy. The label is not uniform across the literature. Closely related formulations appear as policy mirror descent (PMD) in discounted Markov decision processes, as optimistic online mirror descent in self-play preference games, and as adjacent constructions such as Dynamic Mirror Descent or DMD-MPC that are explicitly not OPMD in the stationary-policy sense (Liu et al., 23 Sep 2025, Zhang et al., 24 Feb 2025, Hall et al., 2013, Samineni, 2021). At the level of online convex optimization, Mirror Descent can always achieve a nearly optimal regret guarantee for a broad class of problems, which explains why its policy-space descendants have become a central abstraction for policy optimization (Srebro et al., 2011).

1. Conceptual scope and terminological boundaries

In the most direct reinforcement-learning usage, OPMD is the repeated application of mirror descent to a policy $\pi$ , usually state by state on $\Delta(A)$ , with a policy-value or action-value surrogate playing the role of the linearized objective. The resulting update is a policy-space analogue of Online Mirror Descent rather than ordinary Euclidean gradient descent. Several papers are explicit that they do not use the term “OPMD” even when the mathematical structure is close. “On the Convergence of Policy Mirror Descent with Temporal Difference Evaluation” studies PMD in discounted RL and emphasizes that it is highly relevant to an OPMD interpretation even though it is not an adversarial-online-regret paper (Liu et al., 23 Sep 2025). “Improving LLM General Preference Alignment via Optimistic Online Mirror Descent” is closer to a literal OPMD formulation because it performs optimistic KL-regularized self-play updates directly in policy space (Zhang et al., 24 Feb 2025). By contrast, “Online Optimization in Dynamic Environments” studies Dynamic Mirror Descent in online convex optimization and states plainly that it is not “Online Policy Mirror Descent” in the RL or policy-optimization sense (Hall et al., 2013).

A recurrent source of confusion is that “mirror descent for control” does not always mean policy-space mirror descent over stationary policies. In “Policy Search using Dynamic Mirror Descent MPC for Model Free Off Policy RL”, the inner loop is mirror descent over a finite-horizon control distribution $\pi_\eta$ , not over a stationary policy class $\pi_\theta(a\mid s)$ ; the paper therefore treats DMD-MPC as related to OPMD only indirectly (Samineni, 2021). This boundary matters because the geometry, guarantees, and comparator classes differ substantially between policy-space PMD, predictive dynamic mirror descent, and receding-horizon control-distribution updates.

2. Canonical update rules and mirror geometry

The canonical PMD update in discounted RL is statewise mirror descent on the action simplex: $\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ where the Bregman divergence is

$D_h(p,p')=h(p)-h(p')-\langle \nabla h(p'),\,p-p'\rangle.$

This is the precise PMD template used in TD-PMD, with $Q^k$ supplied either by exact evaluation or by a TD critic (Liu et al., 23 Sep 2025).

Two mirror maps are canonical. With the Euclidean mirror map $h(p)=\frac12\|p\|_2^2$ , PMD reduces to projected-gradient or projected-Q-ascent style updates. With negative entropy $h(p)=\sum_{a\in A}p_a\log p_a$ , PMD yields KL-regularized or exponentiated-gradient updates, and TD-NPG emerges as the entropic instance (Liu et al., 23 Sep 2025). The same entropic geometry appears in self-play preference learning, where

$\psi(\pi)=\sum_y \pi(y)\log \pi(y), \qquad KL(\pi_1\Vert \pi_2)=D_\psi(\pi_1,\pi_2),$

so the mirror map is negative entropy and the induced Bregman divergence is exactly KL divergence (Zhang et al., 24 Feb 2025).

A second structural distinction concerns policy space versus parameter space. “A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence” defines a Bregman projected policy class by

$\Delta(A)$ 0

so that a parameterized score function $\Delta(A)$ 1 acts as an approximate dual iterate and the actual policy is obtained by mirror projection (Alfano et al., 2023). “Convergence of Policy Mirror Descent Beyond Compatible Function Approximation” adds that the natural geometry is not a global Euclidean norm on the policy vector, but a local norm induced by the occupancy measure of the current policy,

$\Delta(A)$ 2

with policy-level Bregman divergence

$\Delta(A)$ 3

This occupancy-weighted geometry is one of the paper’s main theoretical contributions and is explicitly presented as the correct non-Euclidean structure for PMD over general policy classes (Sherman et al., 16 Feb 2025).

3. RL formulations, TD critics, and convergence guarantees

The tabular discounted-MDP theory for PMD is now comparatively sharp. In TD-PMD, the actor step remains the PMD update above, while the critic is updated by one-step temporal difference evaluation: $\Delta(A)$ 4 The central result is that, given exact policy evaluations, TD-PMD retains dimension-free $\Delta(A)$ 5 sublinear convergence for any constant step size and any initialization, and also admits dimension-free $\Delta(A)$ 6-rate linear convergence under adaptive step sizes (Liu et al., 23 Sep 2025). The same paper further shows policy-domain convergence for two common instances: TD-PQA enjoys finite-iteration policy convergence, while TD-NPG converges in policy space to an optimal policy (Liu et al., 23 Sep 2025).

The inexact setting is treated under a generative model. There, TD-PMD achieves a last-iterate sample complexity

$\Delta(A)$ 7

improving the previous PMD dependence on $\Delta(A)$ 8 by one factor (Liu et al., 23 Sep 2025). This is still not an online-regret theorem; it is a discounted-RL convergence theorem in which the PMD update is interpreted as a repeated online-style mirror step on the policy simplex.

A later extension removes the generative-model assumption and studies PMD with TD learning under online Markov data. “Policy Mirror Descent with Temporal Difference Learning: Sample Complexity under Online Markov Data” analyzes Expected TD-PMD and Approximate TD-PMD, both of which use the standard PMD actor update

$\Delta(A)$ 9

but update the critic from a single evolving Markov chain rather than from explicit high-accuracy evaluation (Li et al., 30 Dec 2025). Under a small enough constant policy step size, the algorithms achieve $\pi_\eta$ 0 sample complexity for average-time $\pi_\eta$ 1-optimality; with adaptive policy step sizes, the sample complexity improves to $\pi_\eta$ 2 for last-iterate $\pi_\eta$ 3-optimality (Li et al., 30 Dec 2025). This paper is therefore one of the clearest analyses of OPMD-like actor updates with realistic TD critics and dependent data.

4. General parameterization, occupancy geometry, and contextual coupling

Mirror descent is simplest when the policy class is the full product simplex, because the update decomposes state by state. The main theoretical difficulty for modern policy optimization is that practical actors are usually standalone parameterized policies rather than policies induced implicitly from critic values. “A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence” addresses this by lifting PMD to a parameterized score class $\pi_\eta$ 4, fitting

$\pi_\eta$ 5

and then projecting back to policy space via the Bregman-projected class $\pi_\eta$ 6 (Alfano et al., 2023). The resulting AMPO framework yields the first linear-convergence guarantee for a policy-gradient-based method involving general parameterization, and with shallow ReLU networks it gives sample complexity

$\pi_\eta$ 7

up to a nonvanishing error floor (Alfano et al., 2023).

A complementary line replaces strong closure assumptions by a weaker structural condition. “Convergence of Policy Mirror Descent Beyond Compatible Function Approximation” introduces variational gradient dominance (VGD),

$\pi_\eta$ 8

and analyzes approximate PMD in a local occupancy-weighted norm induced by the current policy (Sherman et al., 16 Feb 2025). The paper treats PMD as a form of smooth non-convex optimization in non-Euclidean space and derives convergence to the best-in-class policy $\pi_\eta$ 9, not necessarily to the globally optimal policy, under general convex policy classes (Sherman et al., 16 Feb 2025). This is a direct attempt to preserve mirror-descent structure under function approximation without the usual compatible-function-approximation machinery.

The remaining obstacle is what “Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parametric Policies” calls contextual coupling. When the actor is a shared-parameter policy $\pi_\theta(a\mid s)$ 0, changing $\pi_\theta(a\mid s)$ 1 changes $\pi_\theta(a\mid s)$ 2 for many states at once, so the feasible set is no longer a product of independent per-state simplices. The paper identifies contextual coupling as the core difficulty when extending mirror descent to parameterized policies, and argues that connecting mirror descent to natural policy gradient yields new analyses, guarantees, and algorithmic insights (Li et al., 27 Feb 2026). In the terminology of OPMD, this is the point where state-wise policy-space mirror descent and parameter-space policy optimization decisively diverge.

5. Optimism, self-play, and game-theoretic OPMD

One of the clearest direct instantiations of OPMD appears in general-preference LLM alignment. “Improving LLM General Preference Alignment via Optimistic Online Mirror Descent” formulates alignment as a two-player zero-sum game over response distributions,

$\pi_\theta(a\mid s)$ 3

and measures equilibrium error by the duality gap

$\pi_\theta(a\mid s)$ 4

The optimistic policy-space update is

$\pi_\theta(a\mid s)$ 5

with predictor $\pi_\theta(a\mid s)$ 6 (Zhang et al., 24 Feb 2025).

This construction is policy mirror descent with an optimistic hint in KL geometry, and the theoretical gain is explicit: for the average policy $\pi_\theta(a\mid s)$ 7,

$\pi_\theta(a\mid s)$ 8

improving the $\pi_\theta(a\mid s)$ 9 duality-gap rate of the non-optimistic OMD baseline (Zhang et al., 24 Feb 2025). The same paper also emphasizes that the theory is in policy space, not parameter space, and that the practical update can be implemented by minimizing a direct preference loss over sampled winner–loser pairs rather than by fitting a Bradley–Terry reward model (Zhang et al., 24 Feb 2025).

This game-theoretic formulation helps clarify a broader point. In single-agent discounted RL, PMD is usually analyzed as a repeated regularized improvement step; in self-play or general-preference games, the same mirror structure becomes a no-regret or optimistic no-regret dynamic over policy distributions. OPMD is therefore not tied to one feedback model. What persists across formulations is the geometry: a policy, a mirror map, and a first-order surrogate.

6. Regularizer choice, approximation error, and adaptive geometry

The numerical behavior of OPMD depends sharply on the mirror map. “The Hidden Cost of Approximation in Online Mirror Descent” studies inexact OMD and shows that when the regularizer is uniformly smooth, the excess regret caused by additive mirror-step error $\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ 0 is

$\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ 1

and that this dependence is tight (Schlisselberg et al., 27 Nov 2025). For barrier regularizers over the simplex and its subsets, the paper identifies a sharp separation: negative entropy requires exponentially small errors to avoid linear regret, whereas log-barrier and Tsallis regularizers remain robust even when the errors are only polynomial (Schlisselberg et al., 27 Nov 2025). On the full simplex under stochastic losses, negative entropy regains robustness, but the paper also proves that this property does not extend to arbitrary subsets of the simplex (Schlisselberg et al., 27 Nov 2025). A plausible implication is that entropy-regularized OPMD over occupancy-measure polytopes may inherit the same fragility, whereas barrier- or Tsallis-based geometries may be numerically safer.

A second instability arises from dynamic learning rates. “Online mirror descent and dual averaging: keeping pace in the dynamic case” shows that classical OMD can suffer linear regret under nonconstant step sizes, even in simplex settings such as prediction with expert advice, while dual averaging does not (Fang et al., 2020). The proposed fix is stabilization, producing dual-stabilized OMD with regret

$\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ 2

under strong convexity (Fang et al., 2020). In simplex/interior settings, dual-stabilized OMD and dual averaging can even generate the same iterates (Fang et al., 2020). Since entropy-regularized policies live on simplices, this stabilization issue is structurally relevant to OPMD even though the theorem itself is an OCO result.

A third issue is that the best geometry may be unknown. “Improved Regret Guarantees for Online Mirror Descent using a Portfolio of Mirror Maps” proves that mirror maps based on block norms adapt better to sparsity than previous $\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ 3 interpolations, and constructs online convex optimization instances where block-norm mirror maps achieve a polynomial improvement over both OPGD and OEG for sparse losses (Gupta et al., 13 Feb 2026). When the sparsity level is unknown, the paper shows that naively switching between OEG and OPGD can incur linear regret, then proposes a multiplicative-weights meta-algorithm over a family of block norms with regret

$\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ 4

and an adaptive guarantee within an $\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ 5 factor of the best block norm in hindsight (Gupta et al., 13 Feb 2026). For OPMD, this suggests that the mirror map itself may need to be selected online rather than fixed a priori.

7. Adjacent formulations: dynamic prediction, MPC, and value-space mirror descent

Dynamic Mirror Descent in OCO augments the mirror step with a predictive map $\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ 6: $\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ 7 Its regret depends on the model-mismatch quantity

$\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ 8

yielding $\pi_{k+1}(\cdot|s) = \arg\max_{p\in\Delta(A)} \left\{ \eta_k \,\langle p,Q^k(s,\cdot)\rangle - D_h\!\big(p,\pi_k(\cdot|s)\big) \right\},$ 9-type guarantees (Hall et al., 2013). The paper is explicit that this is Dynamic Mirror Descent, not Online Policy Mirror Descent. Still, it provides a clean template for predictive or model-based mirror updates that is often conceptually adjacent to OPMD.

DMD-MPC is closer to control than to policy optimization. In that setting, the optimization variable is the parameter $D_h(p,p')=h(p)-h(p')-\langle \nabla h(p'),\,p-p'\rangle.$ 0 of a finite-horizon control distribution $D_h(p,p')=h(p)-h(p')-\langle \nabla h(p'),\,p-p'\rangle.$ 1, and the inner-loop update is

$D_h(p,p')=h(p)-h(p')-\langle \nabla h(p'),\,p-p'\rangle.$ 2

usually with KL geometry over an exponential-family action-sequence distribution (Samineni, 2021). The paper stresses that this is not standard OPMD over stationary policies; it is mirror descent over MPC control distributions embedded inside a hybrid model-based/model-free RL architecture (Samineni, 2021).

A value-space analogue has recently appeared in “Value Mirror Descent for Reinforcement Learning”. VMD and its stochastic variant SVMD integrate mirror descent into value iteration, attain

$D_h(p,p')=h(p)-h(p')-\langle \nabla h(p'),\,p-p'\rangle.$ 3

sample complexity for general convex regularizers and

$D_h(p,p')=h(p)-h(p')-\langle \nabla h(p'),\,p-p'\rangle.$ 4

under a strongly convex regularizer, and, crucially, establish bounded or convergent Bregman divergence between the generated and optimal policies (Jia et al., 7 Apr 2026). The paper presents this policy-stability property as important for enabling effective online (continual) learning following offline training (Jia et al., 7 Apr 2026). That does not make VMD an OPMD method, but it gives a precise offline-to-online bridge: mirror-descent value methods can produce policy iterates that are already controlled in the same Bregman geometries used by policy-space mirror descent.

OPMD therefore sits within a broader mirror-descent landscape. Its most direct forms are policy-space PMD and optimistic policy-space OMD; its closest relatives include dynamic predictive mirror descent, mirror-descent MPC over control-sequence distributions, and value-space mirror methods that preserve policy divergence structure. The central unifying theme is not the name but the geometry: a first-order surrogate, a Bregman divergence, and an update defined in the space where policies are compared.