Coordinate-Wise Descent Algorithms
- Coordinate-wise descent algorithms are iterative methods that update individual variables or blocks sequentially, making them effective for high-dimensional optimization.
- They use a range of selection rules—cyclic, randomized, and greedy—to balance convergence speed with computational effort.
- Recent advancements extend these methods to nonconvex, online, and distributed settings, often incorporating adaptive and importance sampling techniques.
Coordinate-wise descent algorithms are iterative optimization methods that minimize an objective function by updating one variable or one block of variables at a time, keeping the others fixed. These methods have seen renewed interest due to their scalability for high-dimensional problems and their ability to exploit problem structure for efficient computation. The landscape of coordinate-wise descent encompasses deterministic and randomized variants, primal and primal-dual frameworks, selection rules (cyclic, random, greedy), and recent extensions to nonconvex, online, distributed, and manifold settings.
1. Foundational Principles and Algorithmic Structure
At each iteration, a coordinate-wise descent algorithm selects a coordinate (or block) and solves an update subproblem along that direction. For a convex differentiable objective over , the prototypical gradient step for coordinate is
with the other coordinates unchanged. More generally, for compositional objectives , or block forms, the update may use exact minimization along , a coordinate-wise proximal mapping, or a surrogate quadratic approximation (Wright, 2015, Shi et al., 2016).
The coordinate-selection rule critically affects convergence and efficiency:
- Cyclic: visit each coordinate in a fixed or permuted cyclic order;
- Randomized: select coordinates independently from a fixed or adaptively updated probability distribution;
- Greedy (Gauss–Southwell): pick the coordinate with maximum partial derivative magnitude or maximal expected objective decrease.
Coordinate updates can be generalized to blocks—referred to as block coordinate descent (BCD)—with similar algorithmic templates and analysis (Wright, 2015, Peng et al., 2016).
2. Convergence Theory and Worst-case Guarantees
For convex objectives with coordinate-wise Lipschitz gradients, randomized coordinate descent (RCD) with uniform sampling and stepsizes generates iterates with expected sublinear convergence and, under strong convexity, linear convergence at rate (Wright, 2015, Shi et al., 2016).
Cyclic methods have convergence rates similar in order but with potentially larger constants. Beck–Tetruashvili's analysis for cyclic coordinate descent on -dimensional convex objectives gives
for stepsizes (Wright, 2015, Kamri et al., 2022).
Recent advances use performance estimation problems (PEP) cast as semidefinite programs to compute tight numerical worst-case bounds for CCD and alternating minimization (AM), dramatically improving previous analytical constants (by factors of up to 10 for CCD) (Kamri et al., 2022, Kamri et al., 22 Jul 2025). Notably, for blocks and step size , the exact bound after cycles is
with for CCD, for AM at (Kamri et al., 2022). Lower bounds show deterministic cyclic methods can be up to times slower than full gradient descent in worst case (Kamri et al., 22 Jul 2025).
Accelerated randomized coordinate descent achieves complexity (Wright, 2015, Qu et al., 2014, Kamri et al., 22 Jul 2025), but deterministic cyclic-accelerated analogues are provably inefficient for worst-case performance; randomness is essential for acceleration in practice (Kamri et al., 2022, Kamri et al., 22 Jul 2025).
3. Selection Rules, Sampling, and Adaptivity
ASR (adaptive selection rules) and importance sampling schemes dynamically assign coordinate-update frequencies based on per-coordinate progress rates or local Lipschitz constants, yielding dramatic speedups and often near-optimal allocation of computational resources (Glasmachers et al., 2014, Qu et al., 2014). For instance, Nesterov's analysis suggests sampling coordinates proportional to their Lipschitz constants: to minimize iteration complexity (Qu et al., 2014, Glasmachers et al., 2014). Online adaptation mechanisms, e.g., ACF-CD, estimate optimal sampling frequencies during runtime, automatically adjusting to changing landscape and yielding substantial practical improvements (Glasmachers et al., 2014).
Randomized block selection is robust across a vast range of settings (serial, parallel, distributed, or asynchronous) as long as probabilities are positive for all coordinates (Qu et al., 2014, Peng et al., 2016, Costantini et al., 26 Apr 2025). Arbitrary sampling and importance sampling allow coordinate updates to be tuned to the data distribution or computational architecture.
Greedy rules (Gauss–Southwell) maximize per-step reduction but incur additional cost, especially for nonseparable or large-block scenarios (Wright, 2015, Shi et al., 2016). In distributed and asynchronous settings (e.g., setwise CD in decentralized optimization), local greedy (GS) selection within accessible coordinate subsets affords iteration speedups proportional to set size compared to uniform sampling (Costantini et al., 26 Apr 2025).