Bregman Proximal Gradient Descent

Updated 11 May 2026

Bregman Proximal Gradient Descent is a first-order optimization method that replaces the Euclidean quadratic regularizer with a general Bregman divergence to solve composite minimization problems.
It employs a Bregman-proximal mapping framework, guaranteeing sublinear or linear convergence under relative smoothness and specific Bregman growth conditions.
Variants including stochastic, inertial, and block-coordinate extensions extend its applicability to large-scale, structured, and nonconvex optimization challenges in machine learning and signal processing.

Bregman Proximal Gradient Descent (BPGD) is a first-order optimization algorithm for solving composite minimization problems that generalizes classical proximal gradient methods by replacing the Euclidean quadratic regularization with a general Bregman divergence. This extension enables efficient optimization for both convex and nonconvex objective functions, particularly when the differentiable part of the objective lacks global Lipschitz continuity. BPGD provides rigorous convergence guarantees under the "relative smoothness" condition, supports a wide class of geometries through the choice of Legendre kernel functions, and unifies multiple schemes—including mirror descent, proximal gradient, and NoLips methods—under a single framework (Zhang et al., 2017, Zhou et al., 2015).

1. Mathematical Foundations: Bregman Distance, Relative Smoothness, and Problem Setting

Let $w : \mathbb{R}^n \rightarrow (-\infty, +\infty]$ denote a Legendre function—proper, closed, strictly convex, and differentiable on an open domain $\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ . The associated Bregman distance is

$D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$

which satisfies $D_w(x, y) \ge 0$ with equality iff $x = y$ , but in general is asymmetric.

BPGD aims to solve composite problems of the form

$\min_{x \in Q} \Phi(x) := f(x) + g(x)$

with

$Q = \operatorname{dom}w$ , a closed convex set with nonempty interior,
$f$ convex and continuously differentiable on $\operatorname{int}\,\operatorname{dom}w$ ,
$g$ proper, closed, convex, and possibly nonsmooth, with nonempty intersection of its domain with $\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ 0.

Relative smoothness replaces the usual Lipschitz gradient requirement: $\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ 1 is $\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ 2-smooth relative to $\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ 3 if

$\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ 4

(Zhang et al., 2017, Zhou et al., 2015, Al-Shabili et al., 2022).

2. Algorithmic Framework and Iteration

The canonical BPGD iteration is

$\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ 5

where the step-size $\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ 6 must satisfy $\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ 7 for sublinear or linear convergence, depending on additional assumptions. In operator-theoretic terms, this step is a Bregman-proximal mapping: $\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ 8 Special cases include:

Euclidean proximal gradient method: $\operatorname{int}\,\operatorname{dom}w \ne \emptyset$ 9, recovering standard forward-backward splitting.
NoLips/mirror descent: $D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$ 0 or $D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$ 1 as indicator of a convex set; the iteration reduces to Bregman mirror descent (Zhang et al., 2017, Zhou et al., 2015, Al-Shabili et al., 2022, Benning et al., 2017).

Block-wise and two-reference variants, as in B2B methods (Gao et al., 2019), decouple the Bregman geometry to obtain closed-form projections in high-dimensional or structured problems.

3. Convergence Theory: Sublinear, Linear, and KL Rates

Sublinear convergence

Under relative smoothness (no strong convexity, no growth condition), for step-size $D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$ 2,

$D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$ 3

This generalizes the $D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$ 4 rate of classical proximal gradient descent to the Bregman setting (Zhang et al., 2017).

Linear convergence (Bregman-distance growth condition)

If, in addition, the following Bregman growth condition holds: $D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$ 5 with $D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$ 6, then (for $D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$ 7) the Lyapunov function

$D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$ 8

contracts as

$D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,$ 9

implying $D_w(x, y) \ge 0$ 0-linear convergence in both function value and Bregman distance to the optimal set (Zhang et al., 2017).

Extension to nonconvex objectives and KL inequality

For general nonconvex settings, convergence to stationary points and finite-length property of the sequence are obtained under the Kurdyka–Łojasiewicz (KL) property for appropriate Lyapunov functions. Rates determined by the KL exponent yield sublinear or local linear convergence (Zhang et al., 2019, Mukkamala et al., 2020, Ochs et al., 2017).

4. Variants and Extensions: Stochastic, Inertial, Block-wise, and Multilevel

Stochastic BPGD: Replaces the gradient with a stochastic estimator and leverages variance reduction; under kernel-conditioning, achieves $D_w(x, y) \ge 0$ 1 sample complexity matching lower bounds for nonconvex finite-sum objectives (Zhang, 2024, Ding et al., 2023).
Inertial/Extrapolated BPGD: Adds a Nesterov-type extrapolation step $D_w(x, y) \ge 0$ 2 with adaptive $D_w(x, y) \ge 0$ 3, provably accelerating convergence over non-inertial BPGD (Zhang et al., 2019, Zhang et al., 2019).
Block-Coordinate and Two-Reference BPGD: Alternates updates in blocks using Bregman divergences adapted per block, shown to accelerate and admit closed-form updates in matrix factorization and NMF (Gao et al., 2019, Mukkamala et al., 2019).
Multilevel BPGD: For large-scale problems with hierarchical structure, the ML-BPGD framework recursively constructs lower-dimensional surrogates, achieves global linear rates under Bregman PL-type conditions (Elshiaty et al., 4 Jun 2025).
Delay-tolerant Distributed BPGD: Supports totally asynchronous parallelization and is robust to unbounded communication delays (Chraibi et al., 2024).

5. Applications and Empirical Performance

BPGD and its stochastic/inertial variants have demonstrated superior convergence in:

Problem Class	Bregman Kernel	Key Outcomes
Poisson linear inverse	Burg entropy	Robust to non-Lipschitz $D_w(x, y) \ge 0$ 4; closed-form updates
Phase retrieval, robust stats	Quartic/polynomial	Outperforms generic methods in iteration/time
Deep linear networks/NMF	Structured kernel	Avoids variable bias; closed-form updates; faster conv.
Variational inference	log-partition	Mirror steps = moment projection; monotonic decrease
Image reconstruction, games	KL, Tsallis	Superior scalability; matches classical mirror-descent

Plug-and-play priors and deep-unfolding frameworks exploiting BPGD (PnP-BPGM) outperform Euclidean analogues in Poisson image deblurring and other tasks (Al-Shabili et al., 2022). BPGD admits step sizes and kernels tailored to domain constraints and fits well in structured or non-Euclidean spaces (Briceño-Arias et al., 12 Jun 2025).

6. Practical Considerations: Choice of Bregman Kernel, Step-size, and Regularization

Bregman kernel selection

Quadratic: $D_w(x, y) \ge 0$ 5, Euclidean geometry.
KL/entropy: $D_w(x, y) \ge 0$ 6, for simplex or probability constraints.
Burg: $D_w(x, y) \ge 0$ 7, for positive-orthant constraints.
Problem-specific polynomials: To handle non-Lipschitz objectives, e.g., deep nets.

Step-size control

Fixed step: For known $D_w(x, y) \ge 0$ 8, take $D_w(x, y) \ge 0$ 9.
Line-search / backtracking: When $x = y$ 0 unknown, backtrack to enforce local descent in Bregman divergence (Zhou et al., 2015, Benning et al., 2017).
Adaptive rules: In block or stochastic variants, adaptive $x = y$ 1 maintains contraction (Zhang, 2024).

Regularization structure

BPGD supports nonsmooth convex $x = y$ 2 (e.g., $x = y$ 3, indicator functions), which is handled through proximal operators with respect to the Bregman kernel.

7. Connections to Existing Algorithms and Theoretical Unification

The BPGD framework subsumes:

Classical proximal gradient (when $x = y$ 4 is quadratic).
Mirror descent/NoLips (when $x = y$ 5, general $x = y$ 6).
Incremental and aggregated methods (as in the PLIAG scheme), distributed and block-coordinate algorithms (Zhang et al., 2017, Chraibi et al., 2024, Gao et al., 2019).
Proximal point and model-based minimization schemes, including the "MAP" property, which extends $x = y$ 7-smooth-adaptivity to composite nonsmooth models (Mukkamala et al., 2020).

The framework establishes convergence theory in regimes far beyond Euclidean settings, covering nonconvex, nonsmooth, and large-scale optimization, and aligns with modern needs in machine learning, signal processing, and variational inference.

References:

(Zhang et al., 2017) Proximal-Like Incremental Aggregated Gradient Method with Linear Convergence under Bregman Distance Growth Conditions
(Zhou et al., 2015) A Simple Convergence Analysis of Bregman Proximal Gradient Algorithm
(Al-Shabili et al., 2022) Bregman Plug-and-Play Priors
(Benning et al., 2017) Choose your path wisely: gradient descent in a Bregman distance framework
(Gao et al., 2019) Leveraging Two Reference Functions in Block Bregman Proximal Gradient Descent for Non-convex and Non-Lipschitz Problems
(Zhang et al., 2019) Bregman Proximal Gradient Algorithm with Extrapolation for a class of Nonconvex Nonsmooth Minimization Problems
(Mukkamala et al., 2020) Global Convergence of Model Function Based Bregman Proximal Minimization Algorithms
(Ochs et al., 2017) Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms
(Chraibi et al., 2024) Delay-tolerant distributed Bregman proximal algorithms
(Briceño-Arias et al., 12 Jun 2025) Bregman proximal gradient method for linear optimization under entropic constraints
(Zhang, 2024) Stochastic Bregman Proximal Gradient Method Revisited: Kernel Conditioning and Painless Variance Reduction
(Ding et al., 2023) Nonconvex Stochastic Bregman Proximal Gradient Method with Application to Deep Learning
(Mukkamala et al., 2019) Beyond Alternating Updates for Matrix Factorization with Inertial Bregman Proximal Gradient Algorithms
(Elshiaty et al., 4 Jun 2025) Multilevel Bregman Proximal Gradient Descent
(Guilmeau et al., 2022) Regularized Rényi divergence minimization through Bregman proximal gradient algorithms
(Mukkamala et al., 2019) Bregman Proximal Framework for Deep Linear Neural Networks