Papers
Topics
Authors
Recent
Search
2000 character limit reached

Greedy Coordinate Descent

Updated 8 June 2026
  • Greedy Coordinate Descent is an optimization method that updates the coordinate with the largest potential decrease, making it effective for sparse and high-dimensional problems.
  • It leverages greedy, block, and hybrid selection rules to achieve attractive convergence rates, including linear behavior under strong convexity and sublinear decay otherwise.
  • Efficient implementations using maximum inner product search, parallel strategies, and accelerated variants drive practical speedups in applications from Lasso regression to neural network quantization.

Greedy Coordinate Descent (GCD) is a variant of coordinate descent algorithms where the coordinate to update at each iteration is chosen using a greedy rule, typically selecting the coordinate promising the largest decrease in a surrogate or true objective. In high-dimensional optimization—spanning convex, nonconvex, and composite problems—GCD offers attractive rates, practical speedups, and flexible algorithmic paradigms leveraging problem structure. The method generalizes both to block variants and hybrid strategies, and is closely related to the classical Gauss–Southwell rule. It forms the backbone of numerous state-of-the-art solvers for 1\ell_1-regularized learning, quadratic programming, discrete optimization, quantization of neural networks, and large-scale empirical risk minimization.

1. Problem Formulation and Greedy Selection Rules

The most general setting considers composite objectives of the form

minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),

where ff is convex and coordinatewise LL-smooth, each gig_i is convex (or possibly nonconvex and separable), and typical choices include 1\ell_1 penalties and box constraints (Karimireddy et al., 2018).

At each iteration, GCD evaluates, for each coordinate ii, a potential reduction using a local surrogate objective. In the smooth convex case, the canonical Gauss–Southwell rule picks

it=argmaxiif(α).i_t = \arg\max_i |\nabla_i f(\alpha)|.

For nonsmooth or composite problems, the optimal one-dimensional decrease is used: si(α):=minsgi(αi)[if(α)+s],it=argmaxisi(α).s_i(\alpha) := \min_{s \in \partial g_i(\alpha_i)} [\nabla_i f(\alpha) + s],\quad i_t = \arg\max_i s_i(\alpha). For quadratic problems with nonnegativity or box constraints, the greedy rule computes for each ii: minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),0 and selects minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),1 (Wu et al., 2020).

In block and hybrid schemes, a working set minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),2 of size minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),3 is chosen greedily by maximizing k-dimensional decrease proxies and the block subproblem is solved exactly or approximately (Yuan et al., 2017, Nutini et al., 2017, Bo et al., 2012).

2. Theoretical Guarantees and Convergence Rates

Strongly Convex Case

For composite minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),4 that is strongly convex with respect to minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),5 with modulus minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),6, the Gauss–Southwell–proximal (greedy) rule yields linear convergence independent of ambient dimension: minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),7 for exact or minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),8-approximate selection, minαRnF(α):=f(α)+i=1ngi(αi),\min_{\alpha \in \mathbb{R}^n} F(\alpha) := f(\alpha) + \sum_{i=1}^n g_i(\alpha_i),9 (Karimireddy et al., 2018). The proof stratifies steps as "good" (proximal updates do not cross kinks of nonsmooth ff0) and "bad;" at least half are good, and each good step contracts the residual objective by a fixed fraction. No explicit ff1 dependence appears in the rate.

General Convex and Non-Strongly Convex Case

Without strong convexity, greedy CD achieves sublinear decay: ff2 with ff3 the ff4 diameter of the level set (Karimireddy et al., 2018). For pure smooth objectives, the rate is ff5 in the number of iterations, matching classical coordinate descent but with smaller constants due to greedy selection (Liu et al., 2015, Lu et al., 2018).

Block, Hybrid, and Constrained Extensions

In block-greedy and hybrid-combinatorial schemes, convergence to block-k stationary points is guaranteed. Such stationary points are strictly stronger than those achievable by coordinatewise optimality, yielding fewer spurious local minima in nonconvex settings (Yuan et al., 2017). For equality- and box-constrained optimization, greedy two-coordinate schemes achieve linear convergence under a (proximal) Polyak–Łojasiewicz (PL) condition, with rates again independent of ff6 (Ramesh et al., 2023). For nonnegativity-constrained QPs, GCD achieves global convergence, and with positive definite Hessians, the rate is ff7 (Wu et al., 2020).

3. Efficient Implementation and Algorithmic Advances

Maximum Inner Product Search (MIPS)

For problems admitting the structure ff8 and ff9, the greedy coordinate update can be rephrased as a maximum inner product search (MIPS): LL0 where LL1 is a dynamically maintained active set of lifted feature vectors (Karimireddy et al., 2018). Modern nearest neighbor algorithms (e.g., LSH, HNSW) can implement these queries in sublinear time per iteration, reducing wall-clock cost close to that of a single coordinate-gradient calculation.

Block/Parallel Greedy Schemes

Block-greedy and thread-greedy coordinate descent select the best coordinate per block or per thread, with each update executed in parallel (Scherrer et al., 2012, Scherrer et al., 2012). For block sizes fitted to hardware architecture and blocks formed via clustering for low cross-block correlation, empirical speedups are substantial, especially when LL2-regularized solutions are denser (Scherrer et al., 2012).

Accelerated and Stochastic Greedy Methods

Recent advances include semi-greedy and fully greedy accelerated coordinate descent (ASCD, AGCD), which combine Nesterov-style momentum with Gauss–Southwell selection. ASCD achieves LL3 convergence and accelerated linear convergence under strong convexity, while AGCD typically performs even better in practice, though theoretical guarantees require additional mild conditions (Lu et al., 2018). Mini-batch, block, and stochastic strategies further reduce per-iteration cost or hardware burden (Song et al., 2017).

Large-Scale and Application-Specific Innovations

Greedy CD has been specialized to large-scale Gaussian process regression through block selection solving zero-norm constrained subproblems greedily (Bo et al., 2012), LLM quantization via a tailored greedy descent over discrete codebooks (Nair et al., 2024), and large least squares by double-greedy subspace selection and orthogonalization (Jin et al., 2022). For distributed settings (e.g., feature-wise parallelism in Hadoop clusters), greedy block selection yields faster convergence in both cycle count and wall-clock time, drastically reducing expensive inter-node communication (Mahajan et al., 2014).

4. Practical Performance and Applications

Greedy coordinate descent's practical impact is evidenced in several settings:

  • Sparse regression (LL4-regularized least squares, Lasso): Greedy CD methods, including soft-thresholding coordinate updates and hybrid Ray-Refinement strategies, dominate traditional CD in number of sweeps especially for low-LL5 regimes (Liu et al., 2015).
  • Large-scale linear SVMs: Dual coordinate methods with greedy selection attain faster sparsity and reduced computation (Karimireddy et al., 2018).
  • Nonnegative matrix factorization and NQP: On pure NQP, GCD is orders of magnitude faster than CCD, RCD, and accelerated gradient methods, and achieves rapid convergence in matrix factorization quality (Wu et al., 2020).
  • Deep model quantization: Greedy coordinate selection at the quantization-code level (as in CDQuant) consistently drives lower layerwise reconstruction error and outperforms cyclic/CD variants such as GPTQ, while scaling to LL6-parameter models (Nair et al., 2024).
  • High-dimensional empirical risk minimization (DP/ERM): In private optimization settings, GCD leverages structural sparsity to reduce the penalty incurred by differentially private selection, yielding a utility bound logarithmic rather than polynomial in dimension (Mangold et al., 2022).
  • Inverse problems and system-solving: Greedy versions of Gauss–Seidel and block Kaczmarz schemes exhibit provably faster descent and wall-clock efficiency for linear and quadratic systems, especially when variable selection is judiciously adapted to problem structure (Zhang et al., 2020, Thoppe et al., 2014, Jin et al., 2022).

5. Limitations, Variants, and Open Directions

Key limitations and active directions include:

  • Selection cost: Greedy selection requires full or blockwise evaluation of decrease proxies, inducing at least LL7 (or LL8 for blocks) per iteration unless structure/MIPS tricks are available (Karimireddy et al., 2018, Bo et al., 2012, Nutini et al., 2017).
  • Scalability to very high dimensions: Parallel/thread-greedy and block-greedy schemes ameliorate per-iteration overhead but may entail tradeoffs in load balancing and atomic update costs, especially for highly sparse or clustered features (Scherrer et al., 2012, Scherrer et al., 2012).
  • Nonconvex and discrete optimization: While greedy CD methods drive toward block-k stationary points, global optima are not guaranteed without exhaustive combinatorial search. The combinatorial subproblem dimension is a practical bottleneck (usually LL9 is feasible) (Yuan et al., 2017).
  • Worst-case bounds: For some accelerated or block-greedy variants, worst-case theoretical rates are known only under additional technical conditions or for specific classes of problems (e.g., strong or quadratic growth), though empirical speedups are robust (Lu et al., 2018, Nutini et al., 2017).
  • Hybridization and adaptivity: Combining greedy updates with stochastic, block, or cyclic schemes (e.g., switching to UCD at late stages) balances early rapid decrease with late-stage efficiency; hybrid rules remain an area of active algorithmic development (Karimireddy et al., 2018, Nutini et al., 2017).

6. Summary Table: Representative Variants and Guarantees

Variant / Application Greedy Rule Type Theoretical Rate Specialized Implementation Reference
Composite convex (sparse SVM/L1) Gauss–Southwell-prox Linear (gig_i0-independent, strong conv.) MIPS search, sublinear iteration cost (Karimireddy et al., 2018)
Non-neg. quadratic programming Optimal decrease per coordinate gig_i1 linear Gradient maintenance for gig_i2 update (Wu et al., 2020)
Hybrid discrete/sparse opt. Block-k greedy Block-k stationary, linear (binary) Exhaustive block search (gig_i3 small) (Yuan et al., 2017)
Distributed block CD (L1-class.) Surrogate decrease Q-linear under strong convexity Local greedy in block, AllReduce, line search (Mahajan et al., 2014)
Quantized LLMs (CDQuant) Greedy discrete coordinate Finite, monotonic, local opt. Full/Block search, per-row Hessian caching (Nair et al., 2024)
Accelerated GCD (ASCD/AGCD) Greedy (with momentum) gig_i4 (semi-greedy), heuristic Hybrid random/greedy update policy (Lu et al., 2018)
Gaussian process regression Block greedy via obj. decrease Linear, global opt. Progressive block building, kernel subsampling (Bo et al., 2012)
2-coordinate equality-constr. Max. gradient gap Linear, gig_i5-independent (PL) Sorting-based selection, block steepest descent (Ramesh et al., 2023)

7. Research Impact and Future Prospects

The greedy coordinate descent paradigm forms a unifying mechanism underlying many contemporary large-scale optimization methods, allowing practitioners to exploit problem structure, data sparsity, and low-dimensional active sets for accelerated convergence. The framework's flexibility supports parallel/distributed architectures, hybridized block/coordinate schemes, and integration with second-order updates (block cubic Newton), positioning GCD as a foundational, continually evolving family of algorithms for modern data-intensive optimization (Karimireddy et al., 2018, Nutini et al., 2017, Cristofari, 2024). Continued efforts in reducing selection overhead, integrating global optima seeking (as in combinatorial hybrids), and exploiting adaptive block formation, as well as theoretical refinement for nonconvex and constrained settings, are central areas of future investigation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Greedy Coordinate Descent.