Papers
Topics
Authors
Recent
Search
2000 character limit reached

Proximal Block-Coordinate Descent

Updated 16 June 2026
  • Proximal BCD methods decompose optimization problems into a smooth component and block-separable nonsmooth regularizers, enabling robust handling of constraints.
  • They utilize adaptive block selection strategies, such as Gauss–Southwell rules, and proximal operators to accelerate convergence and reduce per-iteration cost.
  • Empirical results demonstrate that greedy variable block updates combined with Newton and proximal steps significantly speed up large-scale, structured optimization tasks.

Proximal Block-Coordinate Descent (BCD) methods are a class of optimization algorithms that target problems with composite structure, where the objective decomposes as the sum of a smooth component and a block-separable, possibly nonsmooth, regularizer. BCD methods capitalize on cheap per-iteration cost, low memory requirements, and natural amenability to parallelization and problem structure. The proximal variant enables handling non-smoothness and constraints within each block update, significantly broadening the domain of applicability in machine learning, signal processing, large-scale statistics, and computational mathematics (Nutini et al., 2017).

1. Problem Classes and Structural Decomposition

Proximal BCD addresses structured problems of the form

minxRnF(x)=f(x)+g(x),\min_{x \in \mathbb{R}^n} F(x) = f(x) + g(x),

where ff is a differentiable (possibly nonconvex) function with Lipschitz-continuous gradient and gg is a block-separable, possibly nonsmooth function. Typical examples include 1\ell_1-regularized regression (g(x)=λx1g(x)=\lambda \|x\|_1), indicator constraints for nonnegativity (g(x)=Ix0g(x) = I_{x \geq 0}), and graph-structured objectives (f(x)=(i,j)Efij(xi,xj)f(x) = \sum_{(i,j)\in E} f_{ij}(x_i, x_j)) (Nutini et al., 2017).

The decision variable xx is partitioned as x=(x1,...,xB)x=(x_1, ..., x_B), enabling both fixed (predefined groups) and variable (e.g., greedy, random) notions of blocks. Exploiting this block structure lets BCD operate at scales and over sparsity patterns that are infeasible for full-gradient methods.

2. Block Partitioning, Selection, and Update Rules

The efficacy of Proximal BCD is governed by three main algorithmic ingredients (Nutini et al., 2017):

  • Block Partitioning: Fixed blocks (pre-specified index groups, e.g., in group-Lasso) versus variable blocks (dynamic, often based on magnitude of gradients). Larger blocks may speed up per-iteration progress but at higher per-iteration cost; variable blocks facilitate adaptive, greedy regimes.
  • Block Selection (Greedy and Adaptive Rules):
    • Gauss–Southwell (GS): pick block maximizing bf(x)2\|\nabla_b f(x)\|_2.
    • Gauss–Southwell–Lipschitz (GSL): maximize ff0, ensuring the maximal per-iteration bound ff1.
    • Gauss–Southwell–Quadratic (GSQ): For ff2 block-wise quadratic, maximize quadratic improvement using Hessian surrogates.
    • Diagonal or “top-ff3” rules for efficient yet informative selection.
  • Block Update (Proximal and Matrix-structured):

    • Smooth blocks: Gradient step or Newton/matrix-preconditioned step, supporting adaptive step-size and line-search.
    • Nonsmooth/Composite blocks: Proximal operator for ff4, i.e.,

    ff5

    For ff6 as ff7-norm, the proximal map reduces to soft-thresholding.

For graph-structured or sparsity-enforced problems, message-passing (e.g., Gaussian belief propagation) yields efficient ff8 block updates by exploiting tree/forest substructure (Nutini et al., 2017).

3. Convergence Rates and Theoretical Guarantees

Convergence analysis reveals global and local phenomena (Nutini et al., 2017):

  • Under the Polyak–Łojasiewicz (PL) condition: For ff9 satisfying

gg0

any per-iteration improvement of gg1 yields linear convergence: gg2.

  • Block Greedy (GS, GSL, GSD, GSQ) Updates: These ensure the mixed-norm progress and guarantee the above linear rates, while for general nonconvex gg3, sublinear rates gg4 (gradient norm decrease) apply.
  • Active-set identification and superlinear convergence: For composite objectives where the nonsmooth term gg5 is separable, the active set gg6 is nonsmooth at gg7 is identified in gg8 iterations. After manifold identification, local convergence can be superlinear, and for certain structures (piecewise-quadratic gg9 and polyhedral 1\ell_10), finite termination is possible (Nutini et al., 2017).

4. Implementation Strategies and Practical Acceleration

Practical considerations strongly impact algorithmic performance (Nutini et al., 2017):

  • Estimating Block-Lipschitz Constants: By spectral computation or line-search; efficient line-search loops exploit sparsity and caching for repeated 1\ell_11-evaluations.
  • Block Partitioning: Grouping by quantiles of coordinate-wise Lipschitz constants (“Sort” partitioning) can enhance progress relative to uniform or random partitionings.
  • Variable-Block Selection at Single-Coordinate Cost: Selecting top-1\ell_12 of 1\ell_13 enables fast per-iteration selection even for large 1\ell_14.
  • Message-Passing for Block Solves: For forest-structured blocks, structured Gaussian elimination or belief propagation yields 1\ell_15 Newton updates—dramatically reducing cost for sparse graphs.
  • Proximal Newton vs Two-Metric Projection: Two-metric projections (TMP) implement near-Newton behavior and active manifold identification at a fraction of the full Newton cost.

5. Numerical Evidence and Empirical Results

Extensive experiments validate these theoretical and algorithmic advances (Nutini et al., 2017):

Problem Class Greedy (GS/GSL) vs. Random/Cyclic Effect of Block Model Block Update Method Sparse Graphs/Active-Set
Least Squares, Lasso Greedy 1\ell_16 random 1\ell_17 cyclic Variable 1\ell_18 fixed (greedy); reverse under random Newton 1\ell_19 matrix g(x)=λx1g(x)=\lambda \|x\|_10 gradient Forest-structured blocks yield 10-100g(x)=λx1g(x)=\lambda \|x\|_11 speedup
g(x)=λx1g(x)=\lambda \|x\|_12 regression Proximal-Newton/TMP outperform proximal-gradient; early active set identification and, with sufficient block size, finite termination.

Empirically, BCD with large, variable, greedily chosen blocks (preferably with GSL/GSD rules and matrix/second-order updates) achieves the fastest decrease in objective, especially for ill-conditioned or structured problems.

6. Guidelines for Method Selection and Deployment

From the synthesis in (Nutini et al., 2017):

  • High-cost-per-iteration justified: If the incremental cost of block update is comparable to gradient-cost, use large g(x)=λx1g(x)=\lambda \|x\|_13, greedy selection (GSL/GSD), and matrix or Newton updates with line-search.
  • Block selection: Use variable blocks with greedy selection, fixed blocks under random selection.
  • Graph-structured (sparse) problems: Exploit graph partitioning for induced forests to enable g(x)=λx1g(x)=\lambda \|x\|_14 Newton updates.
  • Active set regime (g(x)=λx1g(x)=\lambda \|x\|_15-type problems): After identification, switch to (proximal) Newton or TMP for local superlinear or finite convergence.
  • Implementation: Invest in efficient block selection/subproblem solves and adapt block-partitioning/selection rules to the problem’s structural properties; estimate block Lipschitz constants adaptively.

7. Extensions and Context within Optimization Literature

Proximal BCD generalizes to settings with nonconvexity, arbitrary block partitioning (fixed or variable), and a diversity of block-selection rules, including message-passing and two-metric projection variants (Nutini et al., 2017). Its framework encapsulates and accelerates many special-case algorithms, e.g., group Lasso, SVM, sparse logistic regression, network flow on graphs, and structured SVMs. The integration of greedy rules, advanced block partitioning, and active-set mechanisms pushes performance and theoretical guarantees beyond classical cyclic or random coordinate methods, establishing proximal BCD as a key component in modern large-scale optimization.


References:

  • "Let's Make Block Coordinate Descent Converge Faster: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence" (Nutini et al., 2017)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Proximal Block-Coordinate Descent (BCD).