Papers
Topics
Authors
Recent
Search
2000 character limit reached

Randomized Coordinate Descent

Updated 8 June 2026
  • Randomized Coordinate Descent (RCD) is an iterative optimization method that updates randomly selected variable subsets, enabling efficient solutions to high-dimensional problems.
  • RCD leverages diverse sampling strategies—including uniform, importance, and volume sampling—to accelerate convergence in both convex and nonconvex optimization settings.
  • Advanced RCD variants incorporate curvature information and parallel/asynchronous updates, enhancing performance in machine learning, signal processing, and large-scale scientific computing.

Randomized Coordinate Descent (RCD) is a class of iterative optimization algorithms for high-dimensional problems in which, at each iteration, only a randomly chosen subset of variables (or a single variable) is updated, typically via a partial minimization over those coordinates. RCD methods have become fundamental tools in large-scale convex and nonconvex optimization, machine learning, signal processing, and scientific computing, owing to their low per-iteration cost, scalability, and favorable convergence properties under mild smoothness or separability assumptions.

1. Basic Principles and Algorithmic Variants

RCD methods address optimization problems of the form

minxRn F(x) := f(x)+Ψ(x),\min_{x\in\mathbb{R}^n}\ F(x)\ :=\ f(x) + \Psi(x),

where ff is smooth (often convex, but possibly nonconvex or strongly convex) and Ψ\Psi is convex and possibly nonsmooth, typically block- or coordinate-separable (Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)) (Fountoulakis et al., 2015, Fountoulakis et al., 2014). At each iteration, a coordinate or a block S{1,,n}S\subset\{1,\dots,n\} is selected at random according to a specified sampling distribution, and an update is performed only on xSx_S. Standard RCD updates correspond to minimizing a first-order quadratic approximation, but modern RCD methods incorporate block structure, partial second-order information, and flexible sampling:

  • Classical serial RCD: Uniformly randomly pick i[n]i\in[n]; update xix_i by xixi1Liif(x)x_i \leftarrow x_i - \frac{1}{L_i}\nabla_i f(x) (where LiL_i is a coordinate Lipschitz constant).
  • Block RCD: Randomly pick a block ff0; update ff1 via the (blockwise) subproblem.
  • Flexible and robust RCD: Incorporate curvature via user-chosen positive-definite ff2 in the quadratic model (Fountoulakis et al., 2015, Fountoulakis et al., 2014).
  • Arbitrary/importance/block/volume sampling: Allow nonuniform, block, or determinant-based sampling for acceleration (Rodomanov et al., 2019, Qu et al., 2014).

The typical update solves, exactly or approximately,

ff3

and performs a step ff4 with a stepsize ff5 found by line search or predetermined rule (Fountoulakis et al., 2015).

2. Sampling Schemes and Parallelization

Sampling strategy is a core algorithmic design choice, with direct impact on convergence rates and practical efficiency.

  • Uniform/serial sampling: Each coordinate (or block) is chosen with equal probability. This is the default in RCD theory (Nesterov 2012).
  • Importance/probability-weighted sampling: Coordinates are selected with probabilities ff6 proportional to their Lipschitz constants ff7 (or other curvature metrics), yielding complexity improvements when ff8 vary significantly (Qu et al., 2014, Csiba et al., 2016).
  • Block/mini-batch sampling: Larger blocks (size ff9) may be sampled per iteration as in "Flexible Coordinate Descent" (FCD), with Ψ\Psi0-nice sampling (Fountoulakis et al., 2015).
  • Volume sampling: Subsets are selected with probability proportional to the determinant of the corresponding principal minor of the Hessian approximation, accelerating convergence in the presence of spectral gaps (Rodomanov et al., 2019).
  • Arbitrary sampling: Generalizes selection to arbitrary distributions and allows incorporation into parallel and asynchronous frameworks (Qu et al., 2014, Fountoulakis et al., 2014).

Parallel RCD methods update multiple non-intersecting blocks in parallel at each step, requiring careful synchronization for feasibility under constraints (Reddi et al., 2014). Asynchronous variants operate without consistent locking and achieve near-linear speedups on sparse or weakly-coupled problems (Reddi et al., 2014).

3. Convergence Theory and Complexity

Rigorous convergence analyses are available for RCD in convex, strongly convex, and nonconvex regimes, frequently expressed in terms of matrix smoothness and spectral constants:

  • Sublinear rates (convex case): For Ψ\Psi1 convex and Ψ\Psi2-smooth (possibly with blockwise or matrix-valued Ψ\Psi3),

Ψ\Psi4

where Ψ\Psi5 depends on initial distance to optimality and problem parameters. For serial sampling, the rate may scale with Ψ\Psi6 or Ψ\Psi7 (Qu et al., 2014, Fountoulakis et al., 2015).

  • Linear rates (strongly convex case): If Ψ\Psi8 is Ψ\Psi9-strongly convex, the method enjoys

Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)0

with Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)1 scaling inversely in Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)2 and/or in the effective condition number (e.g. Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)3) (Kovalev et al., 2018, Fountoulakis et al., 2014, Fountoulakis et al., 2015).

  • Spectral acceleration and block sampling: Selecting larger blocks or using volume-based sampling can significantly improve the rate when Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)4 (curvature) has large spectral gaps, with gains non-linear in the block size Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)5 (Rodomanov et al., 2019).
  • Nonconvex problems: For smooth Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)6, RCD converges to stationary points, escaping strict saddles almost surely under generic conditions (Chen et al., 2021). The expected minimum gradient norm decreases sublinearly,

Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)7

for an appropriate matrix-smoothness constant Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)8 (Szlendak et al., 2023).

  • Composite/regularized objectives: For Ψ(x)=i=1nΨi(xi)\Psi(x) = \sum_{i=1}^n \Psi_i(x_i)9, blockwise-proximal updates and inexact solves yield convergence under mild separability and curvature properties (Fountoulakis et al., 2015, Fountoulakis et al., 2014).

4. Extensions and Advanced Methodologies

Recent research has generalized RCD in multiple directions:

  • Second-order and curvature-adaptive variants: Robust Coordinate Descent (RCD) and Flexible Coordinate Descent (FCD) incorporate blockwise Hessian or quasi-Newton approximations, yielding improved robustness and acceleration on ill-conditioned or highly coupled problems (Fountoulakis et al., 2014, Fountoulakis et al., 2015). Local superlinear rates are sometimes achievable.
  • Bregman and non-Euclidean descent: Randomized Bregman Coordinate Descent extends applicability to problems lacking Lipschitz gradient by using relative smoothness and Bregman divergence, retaining efficient complexity bounds and supporting acceleration (Gao et al., 2020).
  • Sketch-and-project and subspace-constrained methods: Subspace-constrained RCD (SC-RCD) imposes affine constraints to implicitly precondition the problem using low-rank spectral approximations, dramatically reducing iteration count for matrices with decaying spectra (Lok et al., 11 Jun 2025).
  • Online/stochastic and variance-reduced RCD: Algorithms such as ORBCD, SARCD, OARCD, and variance-reduced RBCD push RCD into streaming and online settings, matching the best SGD rates while preserving low per-iteration costs, sometimes adding acceleration via Nesterov-style momentum (Wang et al., 2014, Bhandari et al., 2018).
  • Distributed and quantized RCD: Practical deployment on distributed architectures imposes quantization constraints, for which modified convergence guarantees are available provided quantization errors are properly bounded (Gamal et al., 2016).
  • Generalized sampling: The ALPHA framework unifies deterministic, stochastic, serial, and parallel variants of RCD under arbitrary sampling, supporting both accelerated and non-accelerated regimes (Qu et al., 2014).

5. Applications and Empirical Performance

RCD methods are standard for training linear predictors (e.g., logistic regression, SVMs), empirical risk minimization, sparse regression, tensor and matrix factorization, and solving large-scale linear systems. In linear systems, they outperform classical iterative methods like Kaczmarz in iteration count and computational cost for overdetermined least squares (Dumitrescu, 2014).

Flexibility and scalability make RCD attractive for massive data problems. Curvature-aware variants (FCD, RCD w/ block diagonals) outperform traditional coordinate descent, especially in high-dimensional, ill-conditioned, or nonseparable problems (Fountoulakis et al., 2015, Fountoulakis et al., 2014). Volume sampling and spectral augmentation yield dramatic speedups when the Hessian (or equivalent matrix) has significant spectral gaps (Rodomanov et al., 2019, Kovalev et al., 2018). Empirical benchmarks confirm the theoretical advantages across a spectrum of synthetic and real-world problems, from regression and classification to kernel methods and deep network training (Szlendak et al., 2023, Fountoulakis et al., 2015, Lok et al., 11 Jun 2025).

6. Stability, Generalization, and Statistical Insights

Algorithmic stability analyses have recently been developed for RCD, establishing that RCD enjoys superior argument stability relative to stochastic gradient descent (SGD)—by a factor of S{1,,n}S\subset\{1,\dots,n\}0 in the S{1,,n}S\subset\{1,\dots,n\}1 bound—at fixed pass count (Wang et al., 2021). This implies sharper generalization error bounds and allows for principled early stopping to balance optimization and estimation errors. For convex and strongly convex objectives, optimal S{1,,n}S\subset\{1,\dots,n\}2 and S{1,,n}S\subset\{1,\dots,n\}3 excess risk rates are achievable, matching those of SGD but with smaller estimation error per coordinate. High-probability generalization results also hold.

7. Limitations and Open Directions

Despite broad applicability, classical RCD becomes inefficient for problems with strong coupling (large off-diagonal Hessian entries) unless enriched with curvature information or preconditioning (Fountoulakis et al., 2014, Lok et al., 11 Jun 2025). The design of sampling probabilities for optimal performance in arbitrary and block settings remains an active research area, as does combining RCD with higher-order or sketching techniques for further acceleration (Kovalev et al., 2018, Lok et al., 11 Jun 2025). Future work includes extending RCD’s convergence theory under milder smoothness or composite regularizers, adaptive step-size and coordinate selection, distributed asynchronous settings, and non-stationary and time-varying optimization landscapes.


Key references:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Randomized Coordinate Descent.