Linearized Bregman Iterations

Updated 10 February 2026

Linearized Bregman iterations are an optimization method that leverages mirror descent and Bregman divergence to produce sparse solutions in empirical risk minimization.
The multilevel approach alternates between fixed support updates and dynamic activation of new nonzeros, greatly reducing computational costs and ensuring robust convergence.
Empirical results on benchmarks like CIFAR-10 and TinyImageNet show significant savings in FLOPs and training time while maintaining competitive accuracy under extreme sparsity.

Linearized Bregman iterations (LinBreg) are an optimization method for producing sparse solutions to empirical risk minimization in deep learning, grounded in mirror descent and Bregman divergence concepts. When combined with dynamically updated support—alternating between periods of fixed “static” sparsity and phases where new support (non-zeros) can be activated—LinBreg enables very efficient, theoretically principled exploration of sparse parameter spaces. Recent work has proposed a multilevel extension that adapts support freezing, providing both strong computational savings and robust convergence in deep neural network training under high global sparsity constraints (Lunk et al., 3 Feb 2026).

1. Mathematical Framework

The core LinBreg approach addresses composite optimization of the form: $\min_{\theta \in \mathbb{R}^{d}}~ \mathcal{L}(\theta) + J(\theta)$ where $\mathcal{L}$ is a differentiable non-convex loss (empirical risk) and $J$ is a proper, convex, lower semi-continuous regularizer, typically promoting sparsity, e.g., $J(\theta) = \lambda \|\theta\|_1$ .

To facilitate mirror descent, LinBreg uses a $\delta$ -elasticized regularizer: $J_\delta(\theta) = \frac{1}{2\delta} \|\theta\|^2 + J(\theta)$ and introduces dual variables $v \in \partial J_\delta(\theta)$ . Each iteration comprises: $\begin{aligned} v^{(k+1)} & = v^{(k)} - \tau \nabla \mathcal{L}(\theta^{(k)}) \ \theta^{(k+1)} & = \mathrm{prox}_{\delta J} (\delta v^{(k+1)}) \end{aligned}$ For $\ell_1$ -regularization, this reduces to iterative soft-thresholding, with $\theta_i$ set to zero when $\mathcal{L}$ 0.

Bregman divergence associated to $\mathcal{L}$ 1 is

$\mathcal{L}$ 2

2. Multilevel LinBreg: Alternating Static and Dynamic Sparsity

In practice, LinBreg is enhanced by freezing the sparsity pattern (“support”) periodically:

The parameter vector $\mathcal{L}$ 3 is split by active groups (i.e., groups for which $\mathcal{L}$ 4).
During $\mathcal{L}$ 5 consecutive “coarse” steps, updates are restricted to the support—no new nonzeros are activated, reducing the dimensionality and cost of both forward/backward passes and prox computations.
A full (“fine”) LinBreg step is then performed—computing the gradient in the full space and updating both the support and the active values, potentially activating new coordinates (restoring “dynamic” sparsity).

Transitions between phases may be scheduled adaptively based on the relative magnitude of the projected gradient, e.g., performing coarse-only updates while

$\mathcal{L}$ 6

where $\mathcal{L}$ 7 is the restriction to supports.

3. Algorithmic Pseudocode

$\mathcal{L}$ 8

Hyperparameters include the sparsity control $\mathcal{L}$ 9, freezing interval $J$ 0, and step-sizes $J$ 1 and $J$ 2 for theoretical guarantees.

4. Convergence Properties

Under the following conditions:

$J$ 3 is relatively smooth with respect to $J$ 4
A Polyak–Łojasiewicz-type Bregman inequality holds
The stochastic (“coarse”) gradient estimator is unbiased with bounded variance

one establishes linear convergence in expectation up to a small error from stochasticity: $J$ 5 where $J$ 6 is a geometric contraction rate. As $J$ 7, $J$ 8, so $J$ 9.

The argument leverages alternating full (fine) LinBreg steps, which generate robust descent by relative smoothness and PL inequality, and coarse blocks, which descend in expectation but may incur additional variance. Telescoping across outer iterations yields convergence.

5. Computational Complexity and Efficiency

Standard LinBreg: $J(\theta) = \lambda \|\theta\|_1$ 3 FLOPs/iteration (one gradient and two prox in a sparse regime).
Multilevel (ML) LinBreg: one dense (full) step every $J(\theta) = \lambda \|\theta\|_1$ 4-th iteration; otherwise, all operations restricted to the active support.

$J(\theta) = \lambda \|\theta\|_1$ 5

6. Empirical Results

On standard benchmarks (CIFAR-10, TinyImageNet) and models (ResNet18, VGG16, WideResNet28-10), multilevel LinBreg converges as follows:

Model/Method	Sparsity [%]	Test Acc. [%]
ResNet18, SGD Dense	0.4	92.93
ResNet18, Prune+FT @95%	95.0	90.84
ResNet18, LinBreg $\delta$ 1=0.2	96.0	90.35
ResNet18, ML LinBreg $\delta$ 2=0.007	96.1	90.24
VGG16, Prune+FT @92%	92.0	91.39
VGG16, ML LinBreg $\delta$ 3=0.003	91.3	90.71
WRN28-10, Prune+FT @96%	96.0	90.55
WRN28-10, ML LinBreg $\delta$ 4=0.005	96.6	91.69

ML LinBreg consistently matches or exceeds accuracy of standard LinBreg, pruning+fine-tune, and state-of-the-art dynamic sparse training baselines, with the advantage of never requiring a full dense epoch. Timing experiments with SparseProp layers show an observed $\delta$ 5 reduction in end-to-end training time versus dense baselines for CPU training, with forward and backward times reduced by $\delta$ 6 and $\delta$ 7, respectively (Lunk et al., 3 Feb 2026).

7. Context, Significance, and Implications

The LinBreg/multilevel approach provides a mathematics-driven alternative to purely heuristic dynamic sparse training methods like those based on hard-pruning or drop-regrow mask updates. Its convergence is guaranteed under standard relative smoothness and Bregman-PL assumptions and is backed by empirical results at extreme sparsities. The adaptive freezing procedure maximally exploits the induced sparsity, reducing both the dimensionality of gradient computations and the computational graph for most iterations. This directly translates to tangible FLOP and wall-clock training savings on standard architectures at minimal loss of predictive accuracy.

A plausible implication is that LinBreg-based and multilevel mirror-descent frameworks can serve as "general-purpose" sparse solvers for deep learning models, particularly when ultra-high sparsity and computational efficiency are paramount, and may be extended to more complex settings such as group sparsity, block-structured regularization, or adaptive per-layer regularization. Further investigation is warranted into large-scale distributed implementations and potential extensions to non-convex, structured, or adaptive regularization penalties (Lunk et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Sparse Training of Neural Networks based on Multilevel Mirror Descent (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Linearized Bregman Iterations.

Linearized Bregman Iterations

1. Mathematical Framework

2. Multilevel LinBreg: Alternating Static and Dynamic Sparsity

3. Algorithmic Pseudocode

4. Convergence Properties

5. Computational Complexity and Efficiency

6. Empirical Results

7. Context, Significance, and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics