Proximal Block-Coordinate Descent
- Proximal BCD methods decompose optimization problems into a smooth component and block-separable nonsmooth regularizers, enabling robust handling of constraints.
- They utilize adaptive block selection strategies, such as Gauss–Southwell rules, and proximal operators to accelerate convergence and reduce per-iteration cost.
- Empirical results demonstrate that greedy variable block updates combined with Newton and proximal steps significantly speed up large-scale, structured optimization tasks.
Proximal Block-Coordinate Descent (BCD) methods are a class of optimization algorithms that target problems with composite structure, where the objective decomposes as the sum of a smooth component and a block-separable, possibly nonsmooth, regularizer. BCD methods capitalize on cheap per-iteration cost, low memory requirements, and natural amenability to parallelization and problem structure. The proximal variant enables handling non-smoothness and constraints within each block update, significantly broadening the domain of applicability in machine learning, signal processing, large-scale statistics, and computational mathematics (Nutini et al., 2017).
1. Problem Classes and Structural Decomposition
Proximal BCD addresses structured problems of the form
where is a differentiable (possibly nonconvex) function with Lipschitz-continuous gradient and is a block-separable, possibly nonsmooth function. Typical examples include -regularized regression (), indicator constraints for nonnegativity (), and graph-structured objectives () (Nutini et al., 2017).
The decision variable is partitioned as , enabling both fixed (predefined groups) and variable (e.g., greedy, random) notions of blocks. Exploiting this block structure lets BCD operate at scales and over sparsity patterns that are infeasible for full-gradient methods.
2. Block Partitioning, Selection, and Update Rules
The efficacy of Proximal BCD is governed by three main algorithmic ingredients (Nutini et al., 2017):
- Block Partitioning: Fixed blocks (pre-specified index groups, e.g., in group-Lasso) versus variable blocks (dynamic, often based on magnitude of gradients). Larger blocks may speed up per-iteration progress but at higher per-iteration cost; variable blocks facilitate adaptive, greedy regimes.
- Block Selection (Greedy and Adaptive Rules):
- Gauss–Southwell (GS): pick block maximizing .
- Gauss–Southwell–Lipschitz (GSL): maximize 0, ensuring the maximal per-iteration bound 1.
- Gauss–Southwell–Quadratic (GSQ): For 2 block-wise quadratic, maximize quadratic improvement using Hessian surrogates.
- Diagonal or “top-3” rules for efficient yet informative selection.
- Block Update (Proximal and Matrix-structured):
- Smooth blocks: Gradient step or Newton/matrix-preconditioned step, supporting adaptive step-size and line-search.
- Nonsmooth/Composite blocks: Proximal operator for 4, i.e.,
5
For 6 as 7-norm, the proximal map reduces to soft-thresholding.
For graph-structured or sparsity-enforced problems, message-passing (e.g., Gaussian belief propagation) yields efficient 8 block updates by exploiting tree/forest substructure (Nutini et al., 2017).
3. Convergence Rates and Theoretical Guarantees
Convergence analysis reveals global and local phenomena (Nutini et al., 2017):
- Under the Polyak–Łojasiewicz (PL) condition: For 9 satisfying
0
any per-iteration improvement of 1 yields linear convergence: 2.
- Block Greedy (GS, GSL, GSD, GSQ) Updates: These ensure the mixed-norm progress and guarantee the above linear rates, while for general nonconvex 3, sublinear rates 4 (gradient norm decrease) apply.
- Active-set identification and superlinear convergence: For composite objectives where the nonsmooth term 5 is separable, the active set 6 is nonsmooth at 7 is identified in 8 iterations. After manifold identification, local convergence can be superlinear, and for certain structures (piecewise-quadratic 9 and polyhedral 0), finite termination is possible (Nutini et al., 2017).
4. Implementation Strategies and Practical Acceleration
Practical considerations strongly impact algorithmic performance (Nutini et al., 2017):
- Estimating Block-Lipschitz Constants: By spectral computation or line-search; efficient line-search loops exploit sparsity and caching for repeated 1-evaluations.
- Block Partitioning: Grouping by quantiles of coordinate-wise Lipschitz constants (“Sort” partitioning) can enhance progress relative to uniform or random partitionings.
- Variable-Block Selection at Single-Coordinate Cost: Selecting top-2 of 3 enables fast per-iteration selection even for large 4.
- Message-Passing for Block Solves: For forest-structured blocks, structured Gaussian elimination or belief propagation yields 5 Newton updates—dramatically reducing cost for sparse graphs.
- Proximal Newton vs Two-Metric Projection: Two-metric projections (TMP) implement near-Newton behavior and active manifold identification at a fraction of the full Newton cost.
5. Numerical Evidence and Empirical Results
Extensive experiments validate these theoretical and algorithmic advances (Nutini et al., 2017):
| Problem Class | Greedy (GS/GSL) vs. Random/Cyclic | Effect of Block Model | Block Update Method | Sparse Graphs/Active-Set |
|---|---|---|---|---|
| Least Squares, Lasso | Greedy 6 random 7 cyclic | Variable 8 fixed (greedy); reverse under random | Newton 9 matrix 0 gradient | Forest-structured blocks yield 10-1001 speedup |
| 2 regression | Proximal-Newton/TMP outperform proximal-gradient; early active set identification and, with sufficient block size, finite termination. |
Empirically, BCD with large, variable, greedily chosen blocks (preferably with GSL/GSD rules and matrix/second-order updates) achieves the fastest decrease in objective, especially for ill-conditioned or structured problems.
6. Guidelines for Method Selection and Deployment
From the synthesis in (Nutini et al., 2017):
- High-cost-per-iteration justified: If the incremental cost of block update is comparable to gradient-cost, use large 3, greedy selection (GSL/GSD), and matrix or Newton updates with line-search.
- Block selection: Use variable blocks with greedy selection, fixed blocks under random selection.
- Graph-structured (sparse) problems: Exploit graph partitioning for induced forests to enable 4 Newton updates.
- Active set regime (5-type problems): After identification, switch to (proximal) Newton or TMP for local superlinear or finite convergence.
- Implementation: Invest in efficient block selection/subproblem solves and adapt block-partitioning/selection rules to the problem’s structural properties; estimate block Lipschitz constants adaptively.
7. Extensions and Context within Optimization Literature
Proximal BCD generalizes to settings with nonconvexity, arbitrary block partitioning (fixed or variable), and a diversity of block-selection rules, including message-passing and two-metric projection variants (Nutini et al., 2017). Its framework encapsulates and accelerates many special-case algorithms, e.g., group Lasso, SVM, sparse logistic regression, network flow on graphs, and structured SVMs. The integration of greedy rules, advanced block partitioning, and active-set mechanisms pushes performance and theoretical guarantees beyond classical cyclic or random coordinate methods, establishing proximal BCD as a key component in modern large-scale optimization.
References:
- "Let's Make Block Coordinate Descent Converge Faster: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence" (Nutini et al., 2017)