Coordinate Descent Algorithms

Updated 30 March 2026

Coordinate Descent Algorithms are iterative optimization methods that update one coordinate at a time, leveraging problem structure and sparsity for efficient solutions.
They use various update strategies—exact minimization, gradient or proximal steps—with theoretical convergence guarantees under convex or strongly convex conditions.
Applications span sparse regression, robust statistics, kernel methods, and parallel implementations, making them vital for large-scale machine learning and signal processing.

Coordinate descent algorithms constitute a broad class of optimization methods that iteratively update one (or a small subset) of the variables (coordinates) of the decision vector at a time, rather than updating the entire vector as in classical gradient-based algorithms. This fine-grained approach is particularly effective in large-scale problems, especially when the problem structure allows efficient updates along single coordinates or blocks. Coordinate descent (CD) methods are foundational for contemporary machine learning, signal processing, statistics, and numerical optimization, and serve as the backbone for highly scalable regularized regression, kernel methods, sparse learning, and composite optimization frameworks.

1. Algorithmic Principles and Variants

At each iteration, a coordinate descent method selects a coordinate (or a block of coordinates) and performs a one-dimensional or low-dimensional minimization or descent step, leaving other coordinates fixed. This update can be realized in several ways:

Exact Coordinate Minimization: For a differentiable $f$ , update $x_i \leftarrow \arg\min_t f(x_{-i}, t)$ .
Gradient or Proximal Step: For composite objectives $F(x) = f(x) + r(x)$ with $r$ proximable, update $x_i \leftarrow \arg\min_t f(x^k) + \nabla_i f(x^k)(t - x^k_i) + \frac{L_i}{2}(t - x^k_i)^2 + r_i(t)$ .
Gauss–Seidel or Jacobi Updates: Updates can be performed sequentially (updating $x_i$ based on the latest values) or in parallel (Jacobi-style using stale information) (Dinuzzo, 2010).

Selection strategies include:

Cyclic: Coordinates are updated in a fixed or randomly permuted order.
Randomized: Each coordinate is selected at random, often with uniform or importance-weighted probability.
Gauss–Southwell (Greedy): The coordinate with the largest (absolute) gradient or largest potential decrease is chosen.

Block Coordinate Descent (BCD): Extends CD to update blocks of variables simultaneously, exploiting separable or nearly-separable problem structure (Shi et al., 2016).

Coordinate-Friendly Algorithms: Updates require computational effort proportional to the sparsity or structure of the coordinate (Shi et al., 2016).

2. Convergence Guarantees and Complexity

The convergence rates of coordinate descent are well understood under convexity and (where present) strong convexity:

Selection	Convex Rate	Strongly Convex Rate
Randomized	$O(1/k)$	$O((1 - \mu/(n\bar L))^k)$ for step-size $1/L_i$
Cyclic	$O(1/k)$	$O((1 - c/n)^k)$ , constants depend on $L_i$ structure
Gauss–Southwell	$O(1/k)$ or better	Rate constant smaller, as largest-gradient coordinate

Here, $\bar L = \frac{1}{n}\sum_i L_i$ , and $L_i$ is the Lipschitz constant of $\nabla_i f$ (Shi et al., 2016, Wright, 2015). For composite problems with separable regularization, per-coordinate subproblems often admit closed-form updates (such as soft-thresholding for $\ell_1$ -regularized problems) (Scherrer et al., 2012).

Accelerated Coordinate Descent achieves $O(1/k^2)$ rates in convex problems and faster linear rates in the strongly convex setting using Nesterov-type momentum and non-uniform sampling (Allen-Zhu et al., 2015, Qu et al., 2014).

The per-iteration cost is typically $O(1)$ or $O(s)$ (for block size $s$ ), and can be reduced further by efficient data structures and exploiting sparsity.

3. Sampling Schemes and Adaptive Strategies

The efficiency of CD depends heavily on the coordinate selection probability:

Uniform Sampling: Simple but may be suboptimal if $L_i$ or importance varies greatly across coordinates.
Non-Uniform Sampling: Theoretical and empirical results show that sampling coordinates with probability proportional to $L_i$ or $\sqrt{L_i}$ yields improved rates, up to a $\sqrt{n}$ acceleration over uniform sampling (Allen-Zhu et al., 2015, Qu et al., 2014).
Adaptive Online Frequency Adaptation: The Adaptive Coordinate Frequencies (ACF) method dynamically adapts the sampling probabilities to equalize coordinate-wise expected progress, leading to speed-ups by one or more orders of magnitude in machine learning applications. ACF is achieved by tracking the observed per-coordinate objective decrement and updating the coordinate frequencies accordingly (Glasmachers et al., 2014).

4. Extensions: Manifold, Polytope, and Primal–Dual CD

Coordinate descent has been generalized substantially beyond vector spaces:

Manifold Coordinate Descent: Coordinate directions become tangent-space directions; feasibility is maintained by retracting updated points back onto the manifold (e.g., for Stiefel, Grassmann, hyperbolic, symplectic, symmetric positive semi-definite manifolds). Riemannian coordinate descent enables updates at $O(p)$ cost per coordinate, achieving significant speedups on high-dimensional matrix problems (Han et al., 2024).
Polytope CD (PolyCD, PolyCDwA): For convex optimization over polytopes with tractable vertex sets ( $\ell_1$ -balls, simplices), updates proceed along directions toward or away from vertices. Linear convergence is achieved when strong convexity holds (Mazumder et al., 2023).
Primal–Dual and Non-Separable Problems: Primal-dual coordinate descent methods (e.g., coordinate Vũ–Condat, Chambolle–Pock) address composite and constraint-coupled objectives with non-separable regularizers, including total variation and distributed consensus problems (Fercoq et al., 2015, Bianchi et al., 2014, Alacaoglu et al., 2017). Randomized block-coordinate variants yield sublinear or linear rates under strong convexity, exploiting coordinate-wise step sizes set by local Lipschitz constants.
Smoothing and Beyond Separability: For nonsmooth or nonseparable $F(x)$ , coordinate descent can be applied to smooth approximations (Moreau, forward–backward, Douglas–Rachford envelopes, Nesterov's smoothing) and relative coordinate-smooth functions, preserving fast convergence (Chorobura et al., 2024).

5. Parallel and Asynchronous Implementations

Coordinate descent is especially amenable to parallel and asynchronous architectures:

Synchronous Parallel CD: Multiple coordinates or blocks are updated in parallel, with synchronization at epoch boundaries. Convergence is maintained if the degree of parallelism is not too large relative to feature coupling (Scherrer et al., 2012).
Asynchronous (Lock-Free) CD: Processors update coordinates without locks, reading and writing to shared memory asynchronously. Near-linear speedup is provably attainable if the number of concurrent threads is $O(\sqrt{n})$ (unconstrained) or $O(n^{1/4})$ (separable constraint), provided the problem has bounded delay and sparsity (Liu et al., 2013).
Parallel Algorithms: Shotgun, thread-greedy, and coloring-based parallel schemes offer varying tradeoffs in overhead, speedup, and scalability, with empirical speedups up to 30× over serial per epoch and maintained convergence (Scherrer et al., 2012).

Parallel Scheme	Update Mechanism	Synchronization / Conflict Control	When Preferred
Shotgun	Random block subset	Atomics, bounded by spectral radius	Weakly-correlated covariates
Thread-Greedy	Thread-local best	Minimal (one atomic per thread)	General, robust performance
Coloring-based	Color-class synchronous	No atomics (if correctly colored)	Sparse matrices, expensive conflicts

6. Applications and Specialized CD Methods

Coordinate descent underpins a broad spectrum of modern algorithms:

Sparse Regression/LASSO and Elastic Net: Soft-thresholding coordinate updates, strong scaling for large sparse data; warm-starts and variable screening rules substantially accelerate regularization path computation (Scherrer et al., 2012, Kim et al., 15 Oct 2025).
Robust Regression: Median- or weighted-median coordinate descent for LAD (least-absolute deviation) regression delivers robust solutions at $O(p n \log n)$ per iteration, outperforming simplex-based LP solvers in high dimensions and $p>n$ settings (Naik et al., 19 Mar 2026).
Huber Regression: Exact coordinate descent exploits the structure of the Huber loss, updating only for inlier observations per coordinate; adaptive screening reduces active set size, leads to significant computational gains in high dimensions with heavy-tailed and correlated features (Kim et al., 15 Oct 2025).
Kernel Methods: Closed-form one-dimensional line-searches are feasible for additively separable losses, and are at the core of LIBLINEAR and glmnet (Dinuzzo, 2010).
Online Convex Optimization: CD methods yield regret $O(\sqrt{T})$ (convex case) or $O(\log T)$ (strongly convex) in online settings, with static and dynamic versions analyzed under different selection protocols (Lin et al., 2022).

7. Hybrid and Domain-Specific Innovations

Hybrid variants and domain-specific adaptations further broaden the repertoire of coordinate descent:

Hybrid Line Search and Gradient CD: Selects either a gradient or a line-search update per coordinate based on a threshold; can achieve faster per-epoch training loss reduction in neural networks, especially when parallelized, though per-iteration cost is dominated by line search (Hsiao et al., 2024).
Nonconvex and Structured Problems: Random blockwise CD retains convergence to stationary points and sublinear expected rates even in nonconvex settings through careful step-size and sampling strategies (Patrascu et al., 2013).
Coordinate Descent from Optimal Control: CD emerges naturally via continuous-time optimal control principles, with maximum dissipation enforced via Lyapunov functions built from coordinate norms (Ross, 2023).

Coordinate descent algorithms thus form the algorithmic backbone for a vast range of modern optimization, machine learning, and signal processing tasks. Their efficiency results from the ability to exploit problem separability, block structure, and sparsity; their flexibility from the ease of integrating acceleration, adaptive sampling, distributed computation, and hybridization with other numerical primitives (Wright, 2015, Shi et al., 2016). Continuing research extends CD to new domains, more general structures (manifolds, polytopes), and further strengthens its practical and theoretical guarantees.