Papers
Topics
Authors
Recent
2000 character limit reached

Mirror Descent with Bregman Divergence

Updated 1 January 2026
  • Mirror descent with Bregman divergence is a geometry-aware optimization framework that leverages strictly convex mirror maps to respect problem structures like sparsity and manifold constraints.
  • The method utilizes dual gradient steps and Bregman projections to ensure robust convergence, achieving sublinear to linear rates depending on convexity of the objective.
  • It underpins a wide range of applications, from machine learning and statistics to control and reinforcement learning, and supports distributed and primal-dual optimization settings.

Mirror descent with Bregman divergence is a general first-order optimization framework that extends classical gradient descent by leveraging non-Euclidean geometries, as defined by strictly convex "mirror maps." The essence of mirror descent is the utilization of Bregman divergence—generated by a mirror map—to measure proximity and dictate update steps, enabling algorithms to respect problem structure such as sparsity, simplex constraints, and manifold geometry. This approach yields robust convergence guarantees over a wide family of domains, supports primal-dual and distributed settings, and underpins a vast array of applications in statistics, machine learning, control, and reinforcement learning.

1. Definition and General Framework

Let h:dom(h)Rh: \mathsf{dom}(h) \to \mathbb{R} be a strictly convex, differentiable function with open domain, referred to as the "mirror map." Given an optimization problem minxXf(x)\min_{x \in X} f(x) over a closed convex set XRnX \subset \mathbb{R}^n, the Bregman divergence associated with hh is defined as

Dh(p,q)=h(p)h(q)h(q),pqD_h(p, q) = h(p) - h(q) - \langle \nabla h(q), p - q \rangle

where p,qXdom(h)p, q \in X \cap \mathrm{dom}(h). DhD_h is nonnegative, convex in its first argument, and recovers squared Euclidean distance for h(x)=12x22h(x) = \frac{1}{2}\|x\|_2^2. The mirror descent update is specified by:

  • Dual step: h(yt+1)=h(xt)ηtf(xt)\nabla h(y^{t+1}) = \nabla h(x^t) - \eta_t \nabla f(x^t)
  • Primal projection: xt+1=argminxX{Dh(x,yt+1)}x^{t+1} = \arg\min_{x \in X}\{ D_h(x, y^{t+1}) \} or, equivalently,

xt+1=argminxX{f(xt),xxt+1ηtDh(x,xt)}x^{t+1} = \arg\min_{x \in X}\left\{ \langle \nabla f(x^t), x-x^t \rangle + \frac{1}{\eta_t} D_h(x, x^t)\right\}

This framework respects the geometry of XX by encoding it into the choice of hh, and supports a variety of optimization landscapes (Raskutti et al., 2013).

2. Specialization to Probability Simplex: Entropic Mirror Descent

On the probability simplex Δn1={xR>0n:ixi=1}\Delta^{n-1} = \{x \in \mathbb{R}^n_{>0}: \sum_i x_i = 1 \}, the canonical mirror map is the negative entropy h(x)=ixilogxih(x) = \sum_i x_i \log x_i. For this choice,

Dh(p,q)=ipilog(piqi)D_h(p, q) = \sum_i p_i \log\left( \frac{p_i}{q_i} \right)

which is simply the Kullback-Leibler divergence when p,qΔn1p, q \in \Delta^{n-1}. The Bregman projection onto the simplex consists of normalization: projΔ(η)=η1η\mathrm{proj}_\Delta(\eta) = \frac{\eta}{1^{\top} \eta} The mirror descent update becomes: yt+1=xtexp(ηtf(xt)),    xt+1=yt+11yt+1y^{t+1} = x^t \odot \exp(-\eta_t \nabla f(x^t) ), \;\; x^{t+1} = \frac{y^{t+1}}{1^{\top} y^{t+1}} This entropic form arises naturally in learning probability distributions, portfolio optimization, and boosting (Halder, 2018).

3. Variational Principle and Fixed Point Structure

Mirror descent is variationally equivalent to minimizing a composite objective of the form: Φ(x)=DKL(xc)+H(1x)\Phi(x) = D_{KL}(x \parallel c) + H(1-x) where DKLD_{KL} is the Kullback-Leibler divergence to a reference vector cc ("influence"), and H(1x)H(1-x) is the "extropy" (entropy of the complement). The fixed point xx^* of mirror descent (e.g., the DeGroot–Friedkin map) solves

x=argminxΔn1Φ(x)x^* = \arg\min_{x\in\Delta^{n-1}} \Phi(x)

Strict convexity ensures existence and uniqueness of xx^*, and standard Lyapunov arguments establish convergence (Halder, 2018).

Mirror descent can be viewed as gradient descent in the dual Riemannian geometry, with the metric tensor given by the Hessian 2h\nabla^2 h. The Legendre transform hh^* defines the dual geometry, and the update in dual coordinates corresponds to natural gradient descent: h(θt+1)=h(θt)ηtf(θt)\nabla h(\theta_{t+1}) = \nabla h(\theta_t) - \eta_t\nabla f(\theta_t) which, via the chain rule, becomes steepest descent on the dual manifold (Raskutti et al., 2013). In exponential families, mirror descent with negative entropy achieves asymptotic statistical efficiency, attaining the Cramér–Rao lower bound for parameter estimation.

5. Convergence Analysis and Lyapunov Perspective

Mirror descent admits rigorous convergence guarantees:

  • Sublinear O(1/k)O(1/k) convergence for convex objectives with constant step size
  • Linear (geometric) rate O(qk)O(q^k) for strongly convex objectives, where qq depends on the strong convexity of hh and ff
  • For mirror descent steps, the Bregman divergence Dh(x,xk)D_h(x^*, x^k) serves as a Lyapunov function
  • Integral Quadratic Constraint (IQC) analyses show that the Bregman Lyapunov is a special case of Popov-criterion storage functions, enabling tight rates via matrix inequalities (Li et al., 2022, Li et al., 2023).

6. Practical Applications and Specialized Algorithms

Mirror descent with Bregman divergence is foundational in:

  • Composite, distributed, and online optimization, where geometry-aware updates outperform Euclidean approaches (Yuan et al., 2020, Chen et al., 2021)
  • Policy optimization in reinforcement learning, where PMD-style updates guarantee finite-step optimality and adapt to geometry-inducing divergences (Lin et al., 2022)
  • Stochastic control, both with vector-valued and measure-valued actions. Relative smoothness and strong convexity with respect to DhD_h provide linear or exponential rates, depending on regularization (Sethi et al., 3 Jun 2025, Kerimkulov et al., 2024)
  • Implicit regularization in separable data: choice of the mirror map directly affects margin bounds and learning behavior (Li et al., 2021)
  • Optimization over curved manifolds and norm-constrained sets: dual-norm mirror descent and generalized logarithmic mirrors extend the method to non-Euclidean settings, often yielding closed-form projection-free updates (Nock et al., 2016, Cichocki, 8 Jun 2025)
  • Statistical learning in exponential families, phase retrieval, optimal transport (Sinkhorn algorithm as a mirror descent with KL divergence) (Godeme et al., 2022, 2002.03758)

7. Algorithmic Templates and Implementation Considerations

The prototypical mirror descent algorithm is:

1
2
3
4
for t in range(1, T+1):
    g_t = f(x_t)
    dual = h(x_t) - η_t * g_t
    x_{t+1} = (h)^{-1}(dual)

  • The efficiency of mirror descent depends on the choice of hh and the tractability of inverting h\nabla h.
  • For simplex domains, exponentiated-gradient and its generalizations (Tempesta logarithms, Tsallis/Kaniadakis mirrors) provide closed-form updates and enable domain adaptation via hyperparameters (Cichocki, 8 Jun 2025).
  • For distributed and non-smooth optimization, Bregman damping and ergodic gap analysis yield O(1/t)O(1/t) rates (for saddle point/constrained problems) (Chen et al., 2021).

8. Theoretical and Empirical Insights

Mirror descent unifies proximal, primal-dual, and natural gradient frameworks. The geometry is entirely governed by the mirror map and its Bregman divergence, providing both interpretability and flexibility. Analysis via IQC and Lyapunov methods confirms the tightness of classical rates and allows systematic extension to advanced settings—stochastic, distributed, measure-valued, nonconvex—maintaining robust guarantees (Li et al., 2023, Fatkhullin et al., 2024). Proper tuning of the underlying mirror geometry yields optimal statistical efficiency and domain-adaptive regularization.

Summary Table: Core Components

Component Definition / Role Classical Case
Mirror Map h()h(\cdot) Strictly convex, differentiable potential encoding geometry of XX h(x)=12x22h(x) = \frac{1}{2} \|x\|^2_2
Bregman Divergence Dh(p,q)=h(p)h(q)h(q),pqD_h(p, q) = h(p) - h(q) - \langle \nabla h(q), p-q \rangle Squared Euclidean distance
Dual Step h(yt+1)=h(xt)ηtf(xt)\nabla h(y^{t+1}) = \nabla h(x^t) - \eta_t\nabla f(x^t) Additive (Euclidean) update
Primal Projection xt+1=argminxXDh(x,yt+1)x^{t+1} = \arg\min_{x\in X} D_h(x, y^{t+1}) Standard Euclidean projection
Typical Geometry Simplex (entropy), orthant, manifold, dual-norm, measure space Rn\mathbb{R}^n
Convergence Rate O(1/k)O(1/k) (convex); O(qk)O(q^k) (strongly convex); exponential (strong regularizer) Same under Euclidean geometry

The generality, geometry-awareness, and provable efficiency of mirror descent with Bregman divergence position it as a cornerstone method in modern convex, stochastic, distributed, and nonconvex optimization.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Mirror Descent with Bregman Divergence.