Linearized Bregman Iterations
- Linearized Bregman iterations are an optimization method that leverages mirror descent and Bregman divergence to produce sparse solutions in empirical risk minimization.
- The multilevel approach alternates between fixed support updates and dynamic activation of new nonzeros, greatly reducing computational costs and ensuring robust convergence.
- Empirical results on benchmarks like CIFAR-10 and TinyImageNet show significant savings in FLOPs and training time while maintaining competitive accuracy under extreme sparsity.
Linearized Bregman Iterations
Linearized Bregman iterations (LinBreg) are an optimization method for producing sparse solutions to empirical risk minimization in deep learning, grounded in mirror descent and Bregman divergence concepts. When combined with dynamically updated support—alternating between periods of fixed “static” sparsity and phases where new support (non-zeros) can be activated—LinBreg enables very efficient, theoretically principled exploration of sparse parameter spaces. Recent work has proposed a multilevel extension that adapts support freezing, providing both strong computational savings and robust convergence in deep neural network training under high global sparsity constraints (Lunk et al., 3 Feb 2026).
1. Mathematical Framework
The core LinBreg approach addresses composite optimization of the form: where is a differentiable non-convex loss (empirical risk) and is a proper, convex, lower semi-continuous regularizer, typically promoting sparsity, e.g., .
To facilitate mirror descent, LinBreg uses a -elasticized regularizer: and introduces dual variables . Each iteration comprises: For -regularization, this reduces to iterative soft-thresholding, with set to zero when 0.
Bregman divergence associated to 1 is
2
2. Multilevel LinBreg: Alternating Static and Dynamic Sparsity
In practice, LinBreg is enhanced by freezing the sparsity pattern (“support”) periodically:
- The parameter vector 3 is split by active groups (i.e., groups for which 4).
- During 5 consecutive “coarse” steps, updates are restricted to the support—no new nonzeros are activated, reducing the dimensionality and cost of both forward/backward passes and prox computations.
- A full (“fine”) LinBreg step is then performed—computing the gradient in the full space and updating both the support and the active values, potentially activating new coordinates (restoring “dynamic” sparsity).
Transitions between phases may be scheduled adaptively based on the relative magnitude of the projected gradient, e.g., performing coarse-only updates while
6
where 7 is the restriction to supports.
3. Algorithmic Pseudocode
8
Hyperparameters include the sparsity control 9, freezing interval 0, and step-sizes 1 and 2 for theoretical guarantees.
4. Convergence Properties
Under the following conditions:
- 3 is relatively smooth with respect to 4
- A Polyak–Łojasiewicz-type Bregman inequality holds
- The stochastic (“coarse”) gradient estimator is unbiased with bounded variance
one establishes linear convergence in expectation up to a small error from stochasticity: 5 where 6 is a geometric contraction rate. As 7, 8, so 9.
The argument leverages alternating full (fine) LinBreg steps, which generate robust descent by relative smoothness and PL inequality, and coarse blocks, which descend in expectation but may incur additional variance. Telescoping across outer iterations yields convergence.
5. Computational Complexity and Efficiency
Let 0 denote the FLOP count for a dense forward pass, 1 for a sparse pass (proportional to remaining non-zeros), and 2 the achieved sparsity.
- Standard LinBreg: 3 FLOPs/iteration (one gradient and two prox in a sparse regime).
- Multilevel (ML) LinBreg: one dense (full) step every 4-th iteration; otherwise, all operations restricted to the active support.
5
In the regime 6 and high sparsity (7–8), the relative computational overhead falls to 9 of dense SGD training, compared to 0 for standard LinBreg (Lunk et al., 3 Feb 2026).
6. Empirical Results
On standard benchmarks (CIFAR-10, TinyImageNet) and models (ResNet18, VGG16, WideResNet28-10), multilevel LinBreg converges as follows:
| Model/Method | Sparsity [%] | Test Acc. [%] |
|---|---|---|
| ResNet18, SGD Dense | 0.4 | 92.93 |
| ResNet18, Prune+FT @95% | 95.0 | 90.84 |
| ResNet18, LinBreg 1=0.2 | 96.0 | 90.35 |
| ResNet18, ML LinBreg 2=0.007 | 96.1 | 90.24 |
| VGG16, Prune+FT @92% | 92.0 | 91.39 |
| VGG16, ML LinBreg 3=0.003 | 91.3 | 90.71 |
| WRN28-10, Prune+FT @96% | 96.0 | 90.55 |
| WRN28-10, ML LinBreg 4=0.005 | 96.6 | 91.69 |
ML LinBreg consistently matches or exceeds accuracy of standard LinBreg, pruning+fine-tune, and state-of-the-art dynamic sparse training baselines, with the advantage of never requiring a full dense epoch. Timing experiments with SparseProp layers show an observed 5 reduction in end-to-end training time versus dense baselines for CPU training, with forward and backward times reduced by 6 and 7, respectively (Lunk et al., 3 Feb 2026).
7. Context, Significance, and Implications
The LinBreg/multilevel approach provides a mathematics-driven alternative to purely heuristic dynamic sparse training methods like those based on hard-pruning or drop-regrow mask updates. Its convergence is guaranteed under standard relative smoothness and Bregman-PL assumptions and is backed by empirical results at extreme sparsities. The adaptive freezing procedure maximally exploits the induced sparsity, reducing both the dimensionality of gradient computations and the computational graph for most iterations. This directly translates to tangible FLOP and wall-clock training savings on standard architectures at minimal loss of predictive accuracy.
A plausible implication is that LinBreg-based and multilevel mirror-descent frameworks can serve as "general-purpose" sparse solvers for deep learning models, particularly when ultra-high sparsity and computational efficiency are paramount, and may be extended to more complex settings such as group sparsity, block-structured regularization, or adaptive per-layer regularization. Further investigation is warranted into large-scale distributed implementations and potential extensions to non-convex, structured, or adaptive regularization penalties (Lunk et al., 3 Feb 2026).