Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bregman Proximal Gradient Descent

Updated 11 May 2026
  • Bregman Proximal Gradient Descent is a first-order optimization method that replaces the Euclidean quadratic regularizer with a general Bregman divergence to solve composite minimization problems.
  • It employs a Bregman-proximal mapping framework, guaranteeing sublinear or linear convergence under relative smoothness and specific Bregman growth conditions.
  • Variants including stochastic, inertial, and block-coordinate extensions extend its applicability to large-scale, structured, and nonconvex optimization challenges in machine learning and signal processing.

Bregman Proximal Gradient Descent (BPGD) is a first-order optimization algorithm for solving composite minimization problems that generalizes classical proximal gradient methods by replacing the Euclidean quadratic regularization with a general Bregman divergence. This extension enables efficient optimization for both convex and nonconvex objective functions, particularly when the differentiable part of the objective lacks global Lipschitz continuity. BPGD provides rigorous convergence guarantees under the "relative smoothness" condition, supports a wide class of geometries through the choice of Legendre kernel functions, and unifies multiple schemes—including mirror descent, proximal gradient, and NoLips methods—under a single framework (Zhang et al., 2017, Zhou et al., 2015).

1. Mathematical Foundations: Bregman Distance, Relative Smoothness, and Problem Setting

Let w:Rn(,+]w : \mathbb{R}^n \rightarrow (-\infty, +\infty] denote a Legendre function—proper, closed, strictly convex, and differentiable on an open domain intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset. The associated Bregman distance is

Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,

which satisfies Dw(x,y)0D_w(x, y) \ge 0 with equality iff x=yx = y, but in general is asymmetric.

BPGD aims to solve composite problems of the form

minxQΦ(x):=f(x)+g(x)\min_{x \in Q} \Phi(x) := f(x) + g(x)

with

  • Q=domwQ = \operatorname{dom}w, a closed convex set with nonempty interior,
  • ff convex and continuously differentiable on intdomw\operatorname{int}\,\operatorname{dom}w,
  • gg proper, closed, convex, and possibly nonsmooth, with nonempty intersection of its domain with intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset0.

Relative smoothness replaces the usual Lipschitz gradient requirement: intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset1 is intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset2-smooth relative to intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset3 if

intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset4

(Zhang et al., 2017, Zhou et al., 2015, Al-Shabili et al., 2022).

2. Algorithmic Framework and Iteration

The canonical BPGD iteration is

intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset5

where the step-size intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset6 must satisfy intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset7 for sublinear or linear convergence, depending on additional assumptions. In operator-theoretic terms, this step is a Bregman-proximal mapping: intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset8 Special cases include:

  • Euclidean proximal gradient method: intdomw\operatorname{int}\,\operatorname{dom}w \ne \emptyset9, recovering standard forward-backward splitting.
  • NoLips/mirror descent: Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,0 or Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,1 as indicator of a convex set; the iteration reduces to Bregman mirror descent (Zhang et al., 2017, Zhou et al., 2015, Al-Shabili et al., 2022, Benning et al., 2017).

Block-wise and two-reference variants, as in B2B methods (Gao et al., 2019), decouple the Bregman geometry to obtain closed-form projections in high-dimensional or structured problems.

3. Convergence Theory: Sublinear, Linear, and KL Rates

Sublinear convergence

Under relative smoothness (no strong convexity, no growth condition), for step-size Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,2,

Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,3

This generalizes the Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,4 rate of classical proximal gradient descent to the Bregman setting (Zhang et al., 2017).

Linear convergence (Bregman-distance growth condition)

If, in addition, the following Bregman growth condition holds: Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,5 with Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,6, then (for Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,7) the Lyapunov function

Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,8

contracts as

Dw(x,y)=w(x)w(y)w(y),xy,D_w(x, y) = w(x) - w(y) - \langle \nabla w(y), x - y \rangle,9

implying Dw(x,y)0D_w(x, y) \ge 00-linear convergence in both function value and Bregman distance to the optimal set (Zhang et al., 2017).

Extension to nonconvex objectives and KL inequality

For general nonconvex settings, convergence to stationary points and finite-length property of the sequence are obtained under the Kurdyka–Łojasiewicz (KL) property for appropriate Lyapunov functions. Rates determined by the KL exponent yield sublinear or local linear convergence (Zhang et al., 2019, Mukkamala et al., 2020, Ochs et al., 2017).

4. Variants and Extensions: Stochastic, Inertial, Block-wise, and Multilevel

  • Stochastic BPGD: Replaces the gradient with a stochastic estimator and leverages variance reduction; under kernel-conditioning, achieves Dw(x,y)0D_w(x, y) \ge 01 sample complexity matching lower bounds for nonconvex finite-sum objectives (Zhang, 2024, Ding et al., 2023).
  • Inertial/Extrapolated BPGD: Adds a Nesterov-type extrapolation step Dw(x,y)0D_w(x, y) \ge 02 with adaptive Dw(x,y)0D_w(x, y) \ge 03, provably accelerating convergence over non-inertial BPGD (Zhang et al., 2019, Zhang et al., 2019).
  • Block-Coordinate and Two-Reference BPGD: Alternates updates in blocks using Bregman divergences adapted per block, shown to accelerate and admit closed-form updates in matrix factorization and NMF (Gao et al., 2019, Mukkamala et al., 2019).
  • Multilevel BPGD: For large-scale problems with hierarchical structure, the ML-BPGD framework recursively constructs lower-dimensional surrogates, achieves global linear rates under Bregman PL-type conditions (Elshiaty et al., 4 Jun 2025).
  • Delay-tolerant Distributed BPGD: Supports totally asynchronous parallelization and is robust to unbounded communication delays (Chraibi et al., 2024).

5. Applications and Empirical Performance

BPGD and its stochastic/inertial variants have demonstrated superior convergence in:

Problem Class Bregman Kernel Key Outcomes
Poisson linear inverse Burg entropy Robust to non-Lipschitz Dw(x,y)0D_w(x, y) \ge 04; closed-form updates
Phase retrieval, robust stats Quartic/polynomial Outperforms generic methods in iteration/time
Deep linear networks/NMF Structured kernel Avoids variable bias; closed-form updates; faster conv.
Variational inference log-partition Mirror steps = moment projection; monotonic decrease
Image reconstruction, games KL, Tsallis Superior scalability; matches classical mirror-descent

Plug-and-play priors and deep-unfolding frameworks exploiting BPGD (PnP-BPGM) outperform Euclidean analogues in Poisson image deblurring and other tasks (Al-Shabili et al., 2022). BPGD admits step sizes and kernels tailored to domain constraints and fits well in structured or non-Euclidean spaces (Briceño-Arias et al., 12 Jun 2025).

6. Practical Considerations: Choice of Bregman Kernel, Step-size, and Regularization

Bregman kernel selection

  • Quadratic: Dw(x,y)0D_w(x, y) \ge 05, Euclidean geometry.
  • KL/entropy: Dw(x,y)0D_w(x, y) \ge 06, for simplex or probability constraints.
  • Burg: Dw(x,y)0D_w(x, y) \ge 07, for positive-orthant constraints.
  • Problem-specific polynomials: To handle non-Lipschitz objectives, e.g., deep nets.

Step-size control

  • Fixed step: For known Dw(x,y)0D_w(x, y) \ge 08, take Dw(x,y)0D_w(x, y) \ge 09.
  • Line-search / backtracking: When x=yx = y0 unknown, backtrack to enforce local descent in Bregman divergence (Zhou et al., 2015, Benning et al., 2017).
  • Adaptive rules: In block or stochastic variants, adaptive x=yx = y1 maintains contraction (Zhang, 2024).

Regularization structure

BPGD supports nonsmooth convex x=yx = y2 (e.g., x=yx = y3, indicator functions), which is handled through proximal operators with respect to the Bregman kernel.

7. Connections to Existing Algorithms and Theoretical Unification

The BPGD framework subsumes:

  • Classical proximal gradient (when x=yx = y4 is quadratic).
  • Mirror descent/NoLips (when x=yx = y5, general x=yx = y6).
  • Incremental and aggregated methods (as in the PLIAG scheme), distributed and block-coordinate algorithms (Zhang et al., 2017, Chraibi et al., 2024, Gao et al., 2019).
  • Proximal point and model-based minimization schemes, including the "MAP" property, which extends x=yx = y7-smooth-adaptivity to composite nonsmooth models (Mukkamala et al., 2020).

The framework establishes convergence theory in regimes far beyond Euclidean settings, covering nonconvex, nonsmooth, and large-scale optimization, and aligns with modern needs in machine learning, signal processing, and variational inference.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bregman Proximal Gradient Descent (BPGD).