Proximal Block-Coordinate Descent

Updated 16 June 2026

Proximal BCD methods decompose optimization problems into a smooth component and block-separable nonsmooth regularizers, enabling robust handling of constraints.
They utilize adaptive block selection strategies, such as Gauss–Southwell rules, and proximal operators to accelerate convergence and reduce per-iteration cost.
Empirical results demonstrate that greedy variable block updates combined with Newton and proximal steps significantly speed up large-scale, structured optimization tasks.

Proximal Block-Coordinate Descent (BCD) methods are a class of optimization algorithms that target problems with composite structure, where the objective decomposes as the sum of a smooth component and a block-separable, possibly nonsmooth, regularizer. BCD methods capitalize on cheap per-iteration cost, low memory requirements, and natural amenability to parallelization and problem structure. The proximal variant enables handling non-smoothness and constraints within each block update, significantly broadening the domain of applicability in machine learning, signal processing, large-scale statistics, and computational mathematics (Nutini et al., 2017).

1. Problem Classes and Structural Decomposition

Proximal BCD addresses structured problems of the form

$\min_{x \in \mathbb{R}^n} F(x) = f(x) + g(x),$

where $f$ is a differentiable (possibly nonconvex) function with Lipschitz-continuous gradient and $g$ is a block-separable, possibly nonsmooth function. Typical examples include $\ell_1$ -regularized regression ( $g(x)=\lambda \|x\|_1$ ), indicator constraints for nonnegativity ( $g(x) = I_{x \geq 0}$ ), and graph-structured objectives ( $f(x) = \sum_{(i,j)\in E} f_{ij}(x_i, x_j)$ ) (Nutini et al., 2017).

The decision variable $x$ is partitioned as $x=(x_1, ..., x_B)$ , enabling both fixed (predefined groups) and variable (e.g., greedy, random) notions of blocks. Exploiting this block structure lets BCD operate at scales and over sparsity patterns that are infeasible for full-gradient methods.

2. Block Partitioning, Selection, and Update Rules

The efficacy of Proximal BCD is governed by three main algorithmic ingredients (Nutini et al., 2017):

Block Partitioning: Fixed blocks (pre-specified index groups, e.g., in group-Lasso) versus variable blocks (dynamic, often based on magnitude of gradients). Larger blocks may speed up per-iteration progress but at higher per-iteration cost; variable blocks facilitate adaptive, greedy regimes.
Block Selection (Greedy and Adaptive Rules):
- Gauss–Southwell (GS): pick block maximizing $\|\nabla_b f(x)\|_2$ .
- Gauss–Southwell–Lipschitz (GSL): maximize $f$ 0, ensuring the maximal per-iteration bound $f$ 1.
- Gauss–Southwell–Quadratic (GSQ): For $f$ 2 block-wise quadratic, maximize quadratic improvement using Hessian surrogates.
- Diagonal or “top- $f$ 3” rules for efficient yet informative selection.
Block Update (Proximal and Matrix-structured):
- Smooth blocks: Gradient step or Newton/matrix-preconditioned step, supporting adaptive step-size and line-search.
- Nonsmooth/Composite blocks: Proximal operator for $f$ 4, i.e.,
$f$ 5

For $f$ 6 as $f$ 7-norm, the proximal map reduces to soft-thresholding.

For graph-structured or sparsity-enforced problems, message-passing (e.g., Gaussian belief propagation) yields efficient $f$ 8 block updates by exploiting tree/forest substructure (Nutini et al., 2017).

3. Convergence Rates and Theoretical Guarantees

Convergence analysis reveals global and local phenomena (Nutini et al., 2017):

Under the Polyak–Łojasiewicz (PL) condition: For $f$ 9 satisfying

$g$ 0

any per-iteration improvement of $g$ 1 yields linear convergence: $g$ 2.

Block Greedy (GS, GSL, GSD, GSQ) Updates: These ensure the mixed-norm progress and guarantee the above linear rates, while for general nonconvex $g$ 3, sublinear rates $g$ 4 (gradient norm decrease) apply.
Active-set identification and superlinear convergence: For composite objectives where the nonsmooth term $g$ 5 is separable, the active set $g$ 6 is nonsmooth at $g$ 7 is identified in $g$ 8 iterations. After manifold identification, local convergence can be superlinear, and for certain structures (piecewise-quadratic $g$ 9 and polyhedral $\ell_1$ 0), finite termination is possible (Nutini et al., 2017).

4. Implementation Strategies and Practical Acceleration

Practical considerations strongly impact algorithmic performance (Nutini et al., 2017):

Estimating Block-Lipschitz Constants: By spectral computation or line-search; efficient line-search loops exploit sparsity and caching for repeated $\ell_1$ 1-evaluations.
Block Partitioning: Grouping by quantiles of coordinate-wise Lipschitz constants (“Sort” partitioning) can enhance progress relative to uniform or random partitionings.
Variable-Block Selection at Single-Coordinate Cost: Selecting top- $\ell_1$ 2 of $\ell_1$ 3 enables fast per-iteration selection even for large $\ell_1$ 4.
Message-Passing for Block Solves: For forest-structured blocks, structured Gaussian elimination or belief propagation yields $\ell_1$ 5 Newton updates—dramatically reducing cost for sparse graphs.
Proximal Newton vs Two-Metric Projection: Two-metric projections (TMP) implement near-Newton behavior and active manifold identification at a fraction of the full Newton cost.

5. Numerical Evidence and Empirical Results

Extensive experiments validate these theoretical and algorithmic advances (Nutini et al., 2017):

Problem Class	Greedy (GS/GSL) vs. Random/Cyclic	Effect of Block Model	Block Update Method	Sparse Graphs/Active-Set
Least Squares, Lasso	Greedy $\ell_1$ 6 random $\ell_1$ 7 cyclic	Variable $\ell_1$ 8 fixed (greedy); reverse under random	Newton $\ell_1$ 9 matrix $g(x)=\lambda \\|x\\|_1$ 0 gradient	Forest-structured blocks yield 10-100 $g(x)=\lambda \\|x\\|_1$ 1 speedup
$g(x)=\lambda \\|x\\|_1$ 2 regression	Proximal-Newton/TMP outperform proximal-gradient; early active set identification and, with sufficient block size, finite termination.

Empirically, BCD with large, variable, greedily chosen blocks (preferably with GSL/GSD rules and matrix/second-order updates) achieves the fastest decrease in objective, especially for ill-conditioned or structured problems.

6. Guidelines for Method Selection and Deployment

From the synthesis in (Nutini et al., 2017):

High-cost-per-iteration justified: If the incremental cost of block update is comparable to gradient-cost, use large $g(x)=\lambda \|x\|_1$ 3, greedy selection (GSL/GSD), and matrix or Newton updates with line-search.
Block selection: Use variable blocks with greedy selection, fixed blocks under random selection.
Graph-structured (sparse) problems: Exploit graph partitioning for induced forests to enable $g(x)=\lambda \|x\|_1$ 4 Newton updates.
Active set regime ( $g(x)=\lambda \|x\|_1$ 5-type problems): After identification, switch to (proximal) Newton or TMP for local superlinear or finite convergence.
Implementation: Invest in efficient block selection/subproblem solves and adapt block-partitioning/selection rules to the problem’s structural properties; estimate block Lipschitz constants adaptively.

7. Extensions and Context within Optimization Literature

Proximal BCD generalizes to settings with nonconvexity, arbitrary block partitioning (fixed or variable), and a diversity of block-selection rules, including message-passing and two-metric projection variants (Nutini et al., 2017). Its framework encapsulates and accelerates many special-case algorithms, e.g., group Lasso, SVM, sparse logistic regression, network flow on graphs, and structured SVMs. The integration of greedy rules, advanced block partitioning, and active-set mechanisms pushes performance and theoretical guarantees beyond classical cyclic or random coordinate methods, establishing proximal BCD as a key component in modern large-scale optimization.

References:

"Let's Make Block Coordinate Descent Converge Faster: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence" (Nutini et al., 2017)

Markdown Report Issue Upgrade to Chat

References (1)

Let's Make Block Coordinate Descent Converge Faster: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proximal Block-Coordinate Descent (BCD).