Papers
Topics
Authors
Recent
Search
2000 character limit reached

Coordinate-Wise Descent Algorithms

Updated 5 January 2026
  • Coordinate-wise descent algorithms are iterative methods that update individual variables or blocks sequentially, making them effective for high-dimensional optimization.
  • They use a range of selection rules—cyclic, randomized, and greedy—to balance convergence speed with computational effort.
  • Recent advancements extend these methods to nonconvex, online, and distributed settings, often incorporating adaptive and importance sampling techniques.

Coordinate-wise descent algorithms are iterative optimization methods that minimize an objective function by updating one variable or one block of variables at a time, keeping the others fixed. These methods have seen renewed interest due to their scalability for high-dimensional problems and their ability to exploit problem structure for efficient computation. The landscape of coordinate-wise descent encompasses deterministic and randomized variants, primal and primal-dual frameworks, selection rules (cyclic, random, greedy), and recent extensions to nonconvex, online, distributed, and manifold settings.

1. Foundational Principles and Algorithmic Structure

At each iteration, a coordinate-wise descent algorithm selects a coordinate (or block) ii and solves an update subproblem along that direction. For a convex differentiable objective ff over xRnx\in\mathbb{R}^n, the prototypical gradient step for coordinate ii is

xi(k+1)=xi(k)αk[f(x(k))]i,x_i^{(k+1)} = x_i^{(k)} - \alpha_k [\nabla f(x^{(k)})]_i,

with the other coordinates unchanged. More generally, for compositional objectives h(x)=f(x)+iri(xi)h(x)=f(x)+\sum_i r_i(x_i), or block forms, the update may use exact minimization along ii, a coordinate-wise proximal mapping, or a surrogate quadratic approximation (Wright, 2015, Shi et al., 2016).

The coordinate-selection rule critically affects convergence and efficiency:

  • Cyclic: visit each coordinate in a fixed or permuted cyclic order;
  • Randomized: select coordinates independently from a fixed or adaptively updated probability distribution;
  • Greedy (Gauss–Southwell): pick the coordinate with maximum partial derivative magnitude or maximal expected objective decrease.

Coordinate updates can be generalized to blocks—referred to as block coordinate descent (BCD)—with similar algorithmic templates and analysis (Wright, 2015, Peng et al., 2016).

2. Convergence Theory and Worst-case Guarantees

For convex objectives with coordinate-wise Lipschitz gradients, randomized coordinate descent (RCD) with uniform sampling and stepsizes αi1/Li\alpha_i \leq 1/L_i generates iterates with expected sublinear convergence O(n/k)O(n/k) and, under strong convexity, linear convergence at rate O((1μnLmax)k)O((1 - \frac{\mu}{nL_{\max}} )^k ) (Wright, 2015, Shi et al., 2016).

Cyclic methods have convergence rates similar in order but with potentially larger constants. Beck–Tetruashvili's analysis for cyclic coordinate descent on nn-dimensional convex objectives gives

f(xk)f4n(1+nL2/Lmax2)R02k+8,f(x^k)-f^* \leq \frac{4 n (1+ n L^2 / L_{\max}^2 ) R_0^2 }{k+8},

for stepsizes α=1/Lmax\alpha=1/L_{\max} (Wright, 2015, Kamri et al., 2022).

Recent advances use performance estimation problems (PEP) cast as semidefinite programs to compute tight numerical worst-case bounds for CCD and alternating minimization (AM), dramatically improving previous analytical constants (by factors of up to 10 for CCD) (Kamri et al., 2022, Kamri et al., 22 Jul 2025). Notably, for pp blocks and step size α=1/L\alpha=1/L, the exact bound after KK cycles is

f(xpK)fCL(K+1)R2,f(x^{pK}) - f^* \leq \frac{C}{L(K+1)} R^2,

with C1.87C \approx 1.87 for CCD, C1.02C \approx 1.02 for AM at p=2p=2 (Kamri et al., 2022). Lower bounds show deterministic cyclic methods can be up to pp times slower than full gradient descent in worst case (Kamri et al., 22 Jul 2025).

Accelerated randomized coordinate descent achieves O(p2/N2)O(p^2/N^2) complexity (Wright, 2015, Qu et al., 2014, Kamri et al., 22 Jul 2025), but deterministic cyclic-accelerated analogues are provably inefficient for worst-case performance; randomness is essential for acceleration in practice (Kamri et al., 2022, Kamri et al., 22 Jul 2025).

3. Selection Rules, Sampling, and Adaptivity

ASR (adaptive selection rules) and importance sampling schemes dynamically assign coordinate-update frequencies based on per-coordinate progress rates or local Lipschitz constants, yielding dramatic speedups and often near-optimal allocation of computational resources (Glasmachers et al., 2014, Qu et al., 2014). For instance, Nesterov's analysis suggests sampling coordinates proportional to their Lipschitz constants: piLip_i \propto L_i to minimize iteration complexity (Qu et al., 2014, Glasmachers et al., 2014). Online adaptation mechanisms, e.g., ACF-CD, estimate optimal sampling frequencies during runtime, automatically adjusting to changing landscape and yielding substantial practical improvements (Glasmachers et al., 2014).

Randomized block selection is robust across a vast range of settings (serial, parallel, distributed, or asynchronous) as long as probabilities are positive for all coordinates (Qu et al., 2014, Peng et al., 2016, Costantini et al., 26 Apr 2025). Arbitrary sampling and importance sampling allow coordinate updates to be tuned to the data distribution or computational architecture.

Greedy rules (Gauss–Southwell) maximize per-step reduction but incur additional cost, especially for nonseparable or large-block scenarios (Wright, 2015, Shi et al., 2016). In distributed and asynchronous settings (e.g., setwise CD in decentralized optimization), local greedy (GS) selection within accessible coordinate subsets affords iteration speedups proportional to set size S|S| compared to uniform sampling (Costantini et al., 26 Apr 2025).

4. Extensions: Nonconvex, Fractional, Distributed

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Coordinate-Wise Descent Algorithms.