Mirror Descent with Bregman Divergence

Updated 1 January 2026

Mirror descent with Bregman divergence is a geometry-aware optimization framework that leverages strictly convex mirror maps to respect problem structures like sparsity and manifold constraints.
The method utilizes dual gradient steps and Bregman projections to ensure robust convergence, achieving sublinear to linear rates depending on convexity of the objective.
It underpins a wide range of applications, from machine learning and statistics to control and reinforcement learning, and supports distributed and primal-dual optimization settings.

Mirror descent with Bregman divergence is a general first-order optimization framework that extends classical gradient descent by leveraging non-Euclidean geometries, as defined by strictly convex "mirror maps." The essence of mirror descent is the utilization of Bregman divergence—generated by a mirror map—to measure proximity and dictate update steps, enabling algorithms to respect problem structure such as sparsity, simplex constraints, and manifold geometry. This approach yields robust convergence guarantees over a wide family of domains, supports primal-dual and distributed settings, and underpins a vast array of applications in statistics, machine learning, control, and reinforcement learning.

1. Definition and General Framework

Let $h: \mathsf{dom}(h) \to \mathbb{R}$ be a strictly convex, differentiable function with open domain, referred to as the "mirror map." Given an optimization problem $\min_{x \in X} f(x)$ over a closed convex set $X \subset \mathbb{R}^n$ , the Bregman divergence associated with $h$ is defined as

$D_h(p, q) = h(p) - h(q) - \langle \nabla h(q), p - q \rangle$

where $p, q \in X \cap \mathrm{dom}(h)$ . $D_h$ is nonnegative, convex in its first argument, and recovers squared Euclidean distance for $h(x) = \frac{1}{2}\|x\|_2^2$ . The mirror descent update is specified by:

Dual step: $\nabla h(y^{t+1}) = \nabla h(x^t) - \eta_t \nabla f(x^t)$
Primal projection: $x^{t+1} = \arg\min_{x \in X}\{ D_h(x, y^{t+1}) \}$ or, equivalently,

$x^{t+1} = \arg\min_{x \in X}\left\{ \langle \nabla f(x^t), x-x^t \rangle + \frac{1}{\eta_t} D_h(x, x^t)\right\}$

This framework respects the geometry of $X$ by encoding it into the choice of $h$ , and supports a variety of optimization landscapes (Raskutti et al., 2013).

2. Specialization to Probability Simplex: Entropic Mirror Descent

On the probability simplex $\Delta^{n-1} = \{x \in \mathbb{R}^n_{>0}: \sum_i x_i = 1 \}$ , the canonical mirror map is the negative entropy $h(x) = \sum_i x_i \log x_i$ . For this choice,

$D_h(p, q) = \sum_i p_i \log\left( \frac{p_i}{q_i} \right)$

which is simply the Kullback-Leibler divergence when $p, q \in \Delta^{n-1}$ . The Bregman projection onto the simplex consists of normalization: $\mathrm{proj}_\Delta(\eta) = \frac{\eta}{1^{\top} \eta}$ The mirror descent update becomes: $y^{t+1} = x^t \odot \exp(-\eta_t \nabla f(x^t) ), \;\; x^{t+1} = \frac{y^{t+1}}{1^{\top} y^{t+1}}$ This entropic form arises naturally in learning probability distributions, portfolio optimization, and boosting (Halder, 2018).

3. Variational Principle and Fixed Point Structure

Mirror descent is variationally equivalent to minimizing a composite objective of the form: $\Phi(x) = D_{KL}(x \parallel c) + H(1-x)$ where $D_{KL}$ is the Kullback-Leibler divergence to a reference vector $c$ ("influence"), and $H(1-x)$ is the "extropy" (entropy of the complement). The fixed point $x^*$ of mirror descent (e.g., the DeGroot–Friedkin map) solves

$x^* = \arg\min_{x\in\Delta^{n-1}} \Phi(x)$

Strict convexity ensures existence and uniqueness of $x^*$ , and standard Lyapunov arguments establish convergence (Halder, 2018).

4. Links to Information Geometry and Natural Gradient

Mirror descent can be viewed as gradient descent in the dual Riemannian geometry, with the metric tensor given by the Hessian $\nabla^2 h$ . The Legendre transform $h^*$ defines the dual geometry, and the update in dual coordinates corresponds to natural gradient descent: $\nabla h(\theta_{t+1}) = \nabla h(\theta_t) - \eta_t\nabla f(\theta_t)$ which, via the chain rule, becomes steepest descent on the dual manifold (Raskutti et al., 2013). In exponential families, mirror descent with negative entropy achieves asymptotic statistical efficiency, attaining the Cramér–Rao lower bound for parameter estimation.

5. Convergence Analysis and Lyapunov Perspective

Mirror descent admits rigorous convergence guarantees:

Sublinear $O(1/k)$ convergence for convex objectives with constant step size
Linear (geometric) rate $O(q^k)$ for strongly convex objectives, where $q$ depends on the strong convexity of $h$ and $f$
For mirror descent steps, the Bregman divergence $D_h(x^*, x^k)$ serves as a Lyapunov function
Integral Quadratic Constraint (IQC) analyses show that the Bregman Lyapunov is a special case of Popov-criterion storage functions, enabling tight rates via matrix inequalities (Li et al., 2022, Li et al., 2023).

6. Practical Applications and Specialized Algorithms

Mirror descent with Bregman divergence is foundational in:

Composite, distributed, and online optimization, where geometry-aware updates outperform Euclidean approaches (Yuan et al., 2020, Chen et al., 2021)
Policy optimization in reinforcement learning, where PMD-style updates guarantee finite-step optimality and adapt to geometry-inducing divergences (Lin et al., 2022)
Stochastic control, both with vector-valued and measure-valued actions. Relative smoothness and strong convexity with respect to $D_h$ provide linear or exponential rates, depending on regularization (Sethi et al., 3 Jun 2025, Kerimkulov et al., 2024)
Implicit regularization in separable data: choice of the mirror map directly affects margin bounds and learning behavior (Li et al., 2021)
Optimization over curved manifolds and norm-constrained sets: dual-norm mirror descent and generalized logarithmic mirrors extend the method to non-Euclidean settings, often yielding closed-form projection-free updates (Nock et al., 2016, Cichocki, 8 Jun 2025)
Statistical learning in exponential families, phase retrieval, optimal transport (Sinkhorn algorithm as a mirror descent with KL divergence) (Godeme et al., 2022, 2002.03758)

7. Algorithmic Templates and Implementation Considerations

The prototypical mirror descent algorithm is:

for t in range(1, T+1):
    g_t = ∇f(x_t)
    dual = ∇h(x_t) - η_t * g_t
    x_{t+1} = (∇h)^{-1}(dual)

The efficiency of mirror descent depends on the choice of $h$ and the tractability of inverting $\nabla h$ .
For simplex domains, exponentiated-gradient and its generalizations (Tempesta logarithms, Tsallis/Kaniadakis mirrors) provide closed-form updates and enable domain adaptation via hyperparameters (Cichocki, 8 Jun 2025).
For distributed and non-smooth optimization, Bregman damping and ergodic gap analysis yield $O(1/t)$ rates (for saddle point/constrained problems) (Chen et al., 2021).

8. Theoretical and Empirical Insights

Mirror descent unifies proximal, primal-dual, and natural gradient frameworks. The geometry is entirely governed by the mirror map and its Bregman divergence, providing both interpretability and flexibility. Analysis via IQC and Lyapunov methods confirms the tightness of classical rates and allows systematic extension to advanced settings—stochastic, distributed, measure-valued, nonconvex—maintaining robust guarantees (Li et al., 2023, Fatkhullin et al., 2024). Proper tuning of the underlying mirror geometry yields optimal statistical efficiency and domain-adaptive regularization.

Summary Table: Core Components

Component	Definition / Role	Classical Case
Mirror Map $h(\cdot)$	Strictly convex, differentiable potential encoding geometry of $X$	$h(x) = \frac{1}{2} \\|x\\|^2_2$
Bregman Divergence	$D_h(p, q) = h(p) - h(q) - \langle \nabla h(q), p-q \rangle$	Squared Euclidean distance
Dual Step	$\nabla h(y^{t+1}) = \nabla h(x^t) - \eta_t\nabla f(x^t)$	Additive (Euclidean) update
Primal Projection	$x^{t+1} = \arg\min_{x\in X} D_h(x, y^{t+1})$	Standard Euclidean projection
Typical Geometry	Simplex (entropy), orthant, manifold, dual-norm, measure space	$\mathbb{R}^n$
Convergence Rate	$O(1/k)$ (convex); $O(q^k)$ (strongly convex); exponential (strong regularizer)	Same under Euclidean geometry

The generality, geometry-awareness, and provable efficiency of mirror descent with Bregman divergence position it as a cornerstone method in modern convex, stochastic, distributed, and nonconvex optimization.