Greedy Coordinate Descent
- Greedy Coordinate Descent is an optimization method that updates the coordinate with the largest potential decrease, making it effective for sparse and high-dimensional problems.
- It leverages greedy, block, and hybrid selection rules to achieve attractive convergence rates, including linear behavior under strong convexity and sublinear decay otherwise.
- Efficient implementations using maximum inner product search, parallel strategies, and accelerated variants drive practical speedups in applications from Lasso regression to neural network quantization.
Greedy Coordinate Descent (GCD) is a variant of coordinate descent algorithms where the coordinate to update at each iteration is chosen using a greedy rule, typically selecting the coordinate promising the largest decrease in a surrogate or true objective. In high-dimensional optimization—spanning convex, nonconvex, and composite problems—GCD offers attractive rates, practical speedups, and flexible algorithmic paradigms leveraging problem structure. The method generalizes both to block variants and hybrid strategies, and is closely related to the classical Gauss–Southwell rule. It forms the backbone of numerous state-of-the-art solvers for -regularized learning, quadratic programming, discrete optimization, quantization of neural networks, and large-scale empirical risk minimization.
1. Problem Formulation and Greedy Selection Rules
The most general setting considers composite objectives of the form
where is convex and coordinatewise -smooth, each is convex (or possibly nonconvex and separable), and typical choices include penalties and box constraints (Karimireddy et al., 2018).
At each iteration, GCD evaluates, for each coordinate , a potential reduction using a local surrogate objective. In the smooth convex case, the canonical Gauss–Southwell rule picks
For nonsmooth or composite problems, the optimal one-dimensional decrease is used: For quadratic problems with nonnegativity or box constraints, the greedy rule computes for each : 0 and selects 1 (Wu et al., 2020).
In block and hybrid schemes, a working set 2 of size 3 is chosen greedily by maximizing k-dimensional decrease proxies and the block subproblem is solved exactly or approximately (Yuan et al., 2017, Nutini et al., 2017, Bo et al., 2012).
2. Theoretical Guarantees and Convergence Rates
Strongly Convex Case
For composite 4 that is strongly convex with respect to 5 with modulus 6, the Gauss–Southwell–proximal (greedy) rule yields linear convergence independent of ambient dimension: 7 for exact or 8-approximate selection, 9 (Karimireddy et al., 2018). The proof stratifies steps as "good" (proximal updates do not cross kinks of nonsmooth 0) and "bad;" at least half are good, and each good step contracts the residual objective by a fixed fraction. No explicit 1 dependence appears in the rate.
General Convex and Non-Strongly Convex Case
Without strong convexity, greedy CD achieves sublinear decay: 2 with 3 the 4 diameter of the level set (Karimireddy et al., 2018). For pure smooth objectives, the rate is 5 in the number of iterations, matching classical coordinate descent but with smaller constants due to greedy selection (Liu et al., 2015, Lu et al., 2018).
Block, Hybrid, and Constrained Extensions
In block-greedy and hybrid-combinatorial schemes, convergence to block-k stationary points is guaranteed. Such stationary points are strictly stronger than those achievable by coordinatewise optimality, yielding fewer spurious local minima in nonconvex settings (Yuan et al., 2017). For equality- and box-constrained optimization, greedy two-coordinate schemes achieve linear convergence under a (proximal) Polyak–Łojasiewicz (PL) condition, with rates again independent of 6 (Ramesh et al., 2023). For nonnegativity-constrained QPs, GCD achieves global convergence, and with positive definite Hessians, the rate is 7 (Wu et al., 2020).
3. Efficient Implementation and Algorithmic Advances
Maximum Inner Product Search (MIPS)
For problems admitting the structure 8 and 9, the greedy coordinate update can be rephrased as a maximum inner product search (MIPS): 0 where 1 is a dynamically maintained active set of lifted feature vectors (Karimireddy et al., 2018). Modern nearest neighbor algorithms (e.g., LSH, HNSW) can implement these queries in sublinear time per iteration, reducing wall-clock cost close to that of a single coordinate-gradient calculation.
Block/Parallel Greedy Schemes
Block-greedy and thread-greedy coordinate descent select the best coordinate per block or per thread, with each update executed in parallel (Scherrer et al., 2012, Scherrer et al., 2012). For block sizes fitted to hardware architecture and blocks formed via clustering for low cross-block correlation, empirical speedups are substantial, especially when 2-regularized solutions are denser (Scherrer et al., 2012).
Accelerated and Stochastic Greedy Methods
Recent advances include semi-greedy and fully greedy accelerated coordinate descent (ASCD, AGCD), which combine Nesterov-style momentum with Gauss–Southwell selection. ASCD achieves 3 convergence and accelerated linear convergence under strong convexity, while AGCD typically performs even better in practice, though theoretical guarantees require additional mild conditions (Lu et al., 2018). Mini-batch, block, and stochastic strategies further reduce per-iteration cost or hardware burden (Song et al., 2017).
Large-Scale and Application-Specific Innovations
Greedy CD has been specialized to large-scale Gaussian process regression through block selection solving zero-norm constrained subproblems greedily (Bo et al., 2012), LLM quantization via a tailored greedy descent over discrete codebooks (Nair et al., 2024), and large least squares by double-greedy subspace selection and orthogonalization (Jin et al., 2022). For distributed settings (e.g., feature-wise parallelism in Hadoop clusters), greedy block selection yields faster convergence in both cycle count and wall-clock time, drastically reducing expensive inter-node communication (Mahajan et al., 2014).
4. Practical Performance and Applications
Greedy coordinate descent's practical impact is evidenced in several settings:
- Sparse regression (4-regularized least squares, Lasso): Greedy CD methods, including soft-thresholding coordinate updates and hybrid Ray-Refinement strategies, dominate traditional CD in number of sweeps especially for low-5 regimes (Liu et al., 2015).
- Large-scale linear SVMs: Dual coordinate methods with greedy selection attain faster sparsity and reduced computation (Karimireddy et al., 2018).
- Nonnegative matrix factorization and NQP: On pure NQP, GCD is orders of magnitude faster than CCD, RCD, and accelerated gradient methods, and achieves rapid convergence in matrix factorization quality (Wu et al., 2020).
- Deep model quantization: Greedy coordinate selection at the quantization-code level (as in CDQuant) consistently drives lower layerwise reconstruction error and outperforms cyclic/CD variants such as GPTQ, while scaling to 6-parameter models (Nair et al., 2024).
- High-dimensional empirical risk minimization (DP/ERM): In private optimization settings, GCD leverages structural sparsity to reduce the penalty incurred by differentially private selection, yielding a utility bound logarithmic rather than polynomial in dimension (Mangold et al., 2022).
- Inverse problems and system-solving: Greedy versions of Gauss–Seidel and block Kaczmarz schemes exhibit provably faster descent and wall-clock efficiency for linear and quadratic systems, especially when variable selection is judiciously adapted to problem structure (Zhang et al., 2020, Thoppe et al., 2014, Jin et al., 2022).
5. Limitations, Variants, and Open Directions
Key limitations and active directions include:
- Selection cost: Greedy selection requires full or blockwise evaluation of decrease proxies, inducing at least 7 (or 8 for blocks) per iteration unless structure/MIPS tricks are available (Karimireddy et al., 2018, Bo et al., 2012, Nutini et al., 2017).
- Scalability to very high dimensions: Parallel/thread-greedy and block-greedy schemes ameliorate per-iteration overhead but may entail tradeoffs in load balancing and atomic update costs, especially for highly sparse or clustered features (Scherrer et al., 2012, Scherrer et al., 2012).
- Nonconvex and discrete optimization: While greedy CD methods drive toward block-k stationary points, global optima are not guaranteed without exhaustive combinatorial search. The combinatorial subproblem dimension is a practical bottleneck (usually 9 is feasible) (Yuan et al., 2017).
- Worst-case bounds: For some accelerated or block-greedy variants, worst-case theoretical rates are known only under additional technical conditions or for specific classes of problems (e.g., strong or quadratic growth), though empirical speedups are robust (Lu et al., 2018, Nutini et al., 2017).
- Hybridization and adaptivity: Combining greedy updates with stochastic, block, or cyclic schemes (e.g., switching to UCD at late stages) balances early rapid decrease with late-stage efficiency; hybrid rules remain an area of active algorithmic development (Karimireddy et al., 2018, Nutini et al., 2017).
6. Summary Table: Representative Variants and Guarantees
| Variant / Application | Greedy Rule Type | Theoretical Rate | Specialized Implementation | Reference |
|---|---|---|---|---|
| Composite convex (sparse SVM/L1) | Gauss–Southwell-prox | Linear (0-independent, strong conv.) | MIPS search, sublinear iteration cost | (Karimireddy et al., 2018) |
| Non-neg. quadratic programming | Optimal decrease per coordinate | 1 linear | Gradient maintenance for 2 update | (Wu et al., 2020) |
| Hybrid discrete/sparse opt. | Block-k greedy | Block-k stationary, linear (binary) | Exhaustive block search (3 small) | (Yuan et al., 2017) |
| Distributed block CD (L1-class.) | Surrogate decrease | Q-linear under strong convexity | Local greedy in block, AllReduce, line search | (Mahajan et al., 2014) |
| Quantized LLMs (CDQuant) | Greedy discrete coordinate | Finite, monotonic, local opt. | Full/Block search, per-row Hessian caching | (Nair et al., 2024) |
| Accelerated GCD (ASCD/AGCD) | Greedy (with momentum) | 4 (semi-greedy), heuristic | Hybrid random/greedy update policy | (Lu et al., 2018) |
| Gaussian process regression | Block greedy via obj. decrease | Linear, global opt. | Progressive block building, kernel subsampling | (Bo et al., 2012) |
| 2-coordinate equality-constr. | Max. gradient gap | Linear, 5-independent (PL) | Sorting-based selection, block steepest descent | (Ramesh et al., 2023) |
7. Research Impact and Future Prospects
The greedy coordinate descent paradigm forms a unifying mechanism underlying many contemporary large-scale optimization methods, allowing practitioners to exploit problem structure, data sparsity, and low-dimensional active sets for accelerated convergence. The framework's flexibility supports parallel/distributed architectures, hybridized block/coordinate schemes, and integration with second-order updates (block cubic Newton), positioning GCD as a foundational, continually evolving family of algorithms for modern data-intensive optimization (Karimireddy et al., 2018, Nutini et al., 2017, Cristofari, 2024). Continued efforts in reducing selection overhead, integrating global optima seeking (as in combinatorial hybrids), and exploiting adaptive block formation, as well as theoretical refinement for nonconvex and constrained settings, are central areas of future investigation.