Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linearized Bregman Iterations

Updated 10 February 2026
  • Linearized Bregman iterations are an optimization method that leverages mirror descent and Bregman divergence to produce sparse solutions in empirical risk minimization.
  • The multilevel approach alternates between fixed support updates and dynamic activation of new nonzeros, greatly reducing computational costs and ensuring robust convergence.
  • Empirical results on benchmarks like CIFAR-10 and TinyImageNet show significant savings in FLOPs and training time while maintaining competitive accuracy under extreme sparsity.

Linearized Bregman Iterations

Linearized Bregman iterations (LinBreg) are an optimization method for producing sparse solutions to empirical risk minimization in deep learning, grounded in mirror descent and Bregman divergence concepts. When combined with dynamically updated support—alternating between periods of fixed “static” sparsity and phases where new support (non-zeros) can be activated—LinBreg enables very efficient, theoretically principled exploration of sparse parameter spaces. Recent work has proposed a multilevel extension that adapts support freezing, providing both strong computational savings and robust convergence in deep neural network training under high global sparsity constraints (Lunk et al., 3 Feb 2026).

1. Mathematical Framework

The core LinBreg approach addresses composite optimization of the form: minθRd L(θ)+J(θ)\min_{\theta \in \mathbb{R}^{d}}~ \mathcal{L}(\theta) + J(\theta) where L\mathcal{L} is a differentiable non-convex loss (empirical risk) and JJ is a proper, convex, lower semi-continuous regularizer, typically promoting sparsity, e.g., J(θ)=λθ1J(\theta) = \lambda \|\theta\|_1.

To facilitate mirror descent, LinBreg uses a δ\delta-elasticized regularizer: Jδ(θ)=12δθ2+J(θ)J_\delta(\theta) = \frac{1}{2\delta} \|\theta\|^2 + J(\theta) and introduces dual variables vJδ(θ)v \in \partial J_\delta(\theta). Each iteration comprises: v(k+1)=v(k)τL(θ(k)) θ(k+1)=proxδJ(δv(k+1))\begin{aligned} v^{(k+1)} & = v^{(k)} - \tau \nabla \mathcal{L}(\theta^{(k)}) \ \theta^{(k+1)} & = \mathrm{prox}_{\delta J} (\delta v^{(k+1)}) \end{aligned} For 1\ell_1-regularization, this reduces to iterative soft-thresholding, with θi\theta_i set to zero when L\mathcal{L}0.

Bregman divergence associated to L\mathcal{L}1 is

L\mathcal{L}2

2. Multilevel LinBreg: Alternating Static and Dynamic Sparsity

In practice, LinBreg is enhanced by freezing the sparsity pattern (“support”) periodically:

  • The parameter vector L\mathcal{L}3 is split by active groups (i.e., groups for which L\mathcal{L}4).
  • During L\mathcal{L}5 consecutive “coarse” steps, updates are restricted to the support—no new nonzeros are activated, reducing the dimensionality and cost of both forward/backward passes and prox computations.
  • A full (“fine”) LinBreg step is then performed—computing the gradient in the full space and updating both the support and the active values, potentially activating new coordinates (restoring “dynamic” sparsity).

Transitions between phases may be scheduled adaptively based on the relative magnitude of the projected gradient, e.g., performing coarse-only updates while

L\mathcal{L}6

where L\mathcal{L}7 is the restriction to supports.

3. Algorithmic Pseudocode

L\mathcal{L}8

Hyperparameters include the sparsity control L\mathcal{L}9, freezing interval JJ0, and step-sizes JJ1 and JJ2 for theoretical guarantees.

4. Convergence Properties

Under the following conditions:

  • JJ3 is relatively smooth with respect to JJ4
  • A Polyak–Łojasiewicz-type Bregman inequality holds
  • The stochastic (“coarse”) gradient estimator is unbiased with bounded variance

one establishes linear convergence in expectation up to a small error from stochasticity: JJ5 where JJ6 is a geometric contraction rate. As JJ7, JJ8, so JJ9.

The argument leverages alternating full (fine) LinBreg steps, which generate robust descent by relative smoothness and PL inequality, and coarse blocks, which descend in expectation but may incur additional variance. Telescoping across outer iterations yields convergence.

5. Computational Complexity and Efficiency

Let J(θ)=λθ1J(\theta) = \lambda \|\theta\|_10 denote the FLOP count for a dense forward pass, J(θ)=λθ1J(\theta) = \lambda \|\theta\|_11 for a sparse pass (proportional to remaining non-zeros), and J(θ)=λθ1J(\theta) = \lambda \|\theta\|_12 the achieved sparsity.

  • Standard LinBreg: J(θ)=λθ1J(\theta) = \lambda \|\theta\|_13 FLOPs/iteration (one gradient and two prox in a sparse regime).
  • Multilevel (ML) LinBreg: one dense (full) step every J(θ)=λθ1J(\theta) = \lambda \|\theta\|_14-th iteration; otherwise, all operations restricted to the active support.

J(θ)=λθ1J(\theta) = \lambda \|\theta\|_15

In the regime J(θ)=λθ1J(\theta) = \lambda \|\theta\|_16 and high sparsity (J(θ)=λθ1J(\theta) = \lambda \|\theta\|_17–J(θ)=λθ1J(\theta) = \lambda \|\theta\|_18), the relative computational overhead falls to J(θ)=λθ1J(\theta) = \lambda \|\theta\|_19 of dense SGD training, compared to δ\delta0 for standard LinBreg (Lunk et al., 3 Feb 2026).

6. Empirical Results

On standard benchmarks (CIFAR-10, TinyImageNet) and models (ResNet18, VGG16, WideResNet28-10), multilevel LinBreg converges as follows:

Model/Method Sparsity [%] Test Acc. [%]
ResNet18, SGD Dense 0.4 92.93
ResNet18, Prune+FT @95% 95.0 90.84
ResNet18, LinBreg δ\delta1=0.2 96.0 90.35
ResNet18, ML LinBreg δ\delta2=0.007 96.1 90.24
VGG16, Prune+FT @92% 92.0 91.39
VGG16, ML LinBreg δ\delta3=0.003 91.3 90.71
WRN28-10, Prune+FT @96% 96.0 90.55
WRN28-10, ML LinBreg δ\delta4=0.005 96.6 91.69

ML LinBreg consistently matches or exceeds accuracy of standard LinBreg, pruning+fine-tune, and state-of-the-art dynamic sparse training baselines, with the advantage of never requiring a full dense epoch. Timing experiments with SparseProp layers show an observed δ\delta5 reduction in end-to-end training time versus dense baselines for CPU training, with forward and backward times reduced by δ\delta6 and δ\delta7, respectively (Lunk et al., 3 Feb 2026).

7. Context, Significance, and Implications

The LinBreg/multilevel approach provides a mathematics-driven alternative to purely heuristic dynamic sparse training methods like those based on hard-pruning or drop-regrow mask updates. Its convergence is guaranteed under standard relative smoothness and Bregman-PL assumptions and is backed by empirical results at extreme sparsities. The adaptive freezing procedure maximally exploits the induced sparsity, reducing both the dimensionality of gradient computations and the computational graph for most iterations. This directly translates to tangible FLOP and wall-clock training savings on standard architectures at minimal loss of predictive accuracy.

A plausible implication is that LinBreg-based and multilevel mirror-descent frameworks can serve as "general-purpose" sparse solvers for deep learning models, particularly when ultra-high sparsity and computational efficiency are paramount, and may be extended to more complex settings such as group sparsity, block-structured regularization, or adaptive per-layer regularization. Further investigation is warranted into large-scale distributed implementations and potential extensions to non-convex, structured, or adaptive regularization penalties (Lunk et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Linearized Bregman Iterations.