Greedy Coordinate-Gradient Hybrid Optimization

Updated 19 April 2026

The Greedy Coordinate-Gradient Hybrid is an optimization method that combines coordinate-wise greedy selection with gradient-based updates to target efficient convergence.
It employs a threshold-based strategy to choose between a fast gradient descent step and a meticulous one-dimensional search, enhancing local minimization.
Empirical and theoretical analyses demonstrate that the hybrid method achieves superior convergence rates and robust performance on high-dimensional and composite objectives.

A Greedy Coordinate-Gradient Hybrid algorithm refers to a class of optimization methodologies that combine greedy coordinate-wise selection strategies with gradient (or subgradient/proximal) information, often selectively mixing coordinate descent steps with gradient-descent or other updates such as explicit line search, cubic Newton, or combinatorial search. These methods are designed to exploit both problem structure (such as sparsity or smoothness) and computational advantages from decomposable updates, yielding improved convergence rates, sharper minimization on “problematic” coordinates, and, frequently, strong empirical performance on large-scale, composite, or overparameterized models.

1. Algorithmic Framework and Motivation

The Greedy Coordinate-Gradient Hybrid paradigm addresses the inefficiencies of pure gradient descent (GD) and coordinate descent (CD). In GD, updates are distributed globally but may be inefficient when only a few parameters are far from optimality. CD, especially with random selection, can be slow in high dimensions due to uniform treatment of all coordinates. The hybrid approach evaluates per-coordinate gradients and applies a greedy criterion to assign, for each coordinate, either a fast gradient-based update or a more refined subroutine (e.g., one-dimensional line search, proximal or cubic minimization) (Hsiao et al., 2024, Karimireddy et al., 2018, Cristofari, 2024, Yuan et al., 2017).

For instance, in the neural network training context (Hsiao et al., 2024), the method determines, for each parameter $\theta_j$ , whether $|g_j|$ surpasses a threshold $\tau$ . If so, a GD step is performed; otherwise, a one-dimensional line search is executed along $e_j$ . This design ensures that coordinates with large gradients benefit from the speed of GD, while “stagnant” or near-critical coordinates are refined thoroughly, exploiting potentially nonconvex or ill-behaved landscapes more efficiently than either method alone.

2. Formal Problem Statement and Update Rules

Let $\theta \in \mathbb{R}^d$ denote the aggregated parameter vector (across model weights and biases). The aim is to minimize an objective, typically of the form:

$L(\theta) = \frac{1}{2}\sum_{i=1}^n (f(\theta; X_i) - y_i)^2$

where $f$ specifies, for example, a two-layer ReLU network:

$f(W,A,x) = \frac{1}{\sqrt{m}}\sum_{r=1}^m a_r \sigma(w_r^T x), \;\;\; \sigma(z) = \max\{z, 0\}$

For each coordinate $j$ , the partial gradient is:

$g_j(\theta) \equiv \frac{\partial L}{\partial \theta_j} = \sum_{i=1}^n (f(\theta; X_i) - y_i) \frac{\partial f(\theta; X_i)}{\partial \theta_j}$

The Greedy Coordinate-Gradient Hybrid update rule is:

If $|g_j|$ 0, perform a gradient (or subgradient) descent step:

$|g_j|$ 1

Else, perform a one-dimensional search (line search or combinatorial minimization):

$|g_j|$ 2

$|g_j|$ 3 tunes the tradeoff between rapid, inexpensive descent (for large-gradient coordinates) and expensive, thorough local minimization (for small-gradient directions).

The above framework is extended, for composite or discrete objectives, to interleave greedy/randomized selection of coordinate blocks, global search within the block (e.g., combinatorial enumeration for support patterns), or coordinate-wise cubic Newton updates (Yuan et al., 2017, Cristofari, 2024).

3. Pseudocode, Key Equations, and Variants

A typical epoch of the greedy coordinate-gradient hybrid (in neural network regression) is as follows (Hsiao et al., 2024):

$e_j$ 1

The crucial equations:

Gradient update: $|g_j|$ 4,
Line search: $|g_j|$ 5,
Jacobi step: $|g_j|$ 6.

For hybrid block methods, e.g., block cubic Newton (Cristofari, 2024), the greedy rule selects the block $|g_j|$ 7 with largest stationarity violation, and the update is the approximate minimizer of the cubic model $|g_j|$ 8 over that block.

4. Convergence Properties and Computational Complexity

Empirical results and theoretical analyses demonstrate that hybrid methods:

Consistently achieve lower objective/empirical loss per epoch than pure GD (Hsiao et al., 2024, Karimireddy et al., 2018, Yuan et al., 2017).
For composite and strong convexity, guarantee dimension-independent Q-linear convergence rates, e.g., $|g_j|$ 9 for GS-s hybrid (Karimireddy et al., 2018).
For combinatorial block search hybrids, show global convergence (in expectation) to block- $\tau$ 0 stationary points and explicit rates---e.g., $\tau$ 1 or strict linear (Q-linear) in support-stabilized phases (Yuan et al., 2017).
For block cubic Newton hybrid (Cristofari, 2024), global convergence to stationarity is proved, with $\tau$ 2 iterations needed for block stationarity and $\tau$ 3 for full-stationarity.

Wall-clock cost depends on the computational bottleneck: coordinate-wise line search or combinatorial search is expensive, but the steps are independent and can be parallelized. In practice, large thresholds $\tau$ 4 reduce the number of expensive searches and enable efficient use of GPU/CPU parallelism, often closing the gap with highly-optimized GD in wall time (Hsiao et al., 2024).

5. Comparison with Pure Coordinate and Pure Gradient Approaches

Scheme	Update performed	Convergence characteristics	Computational features
Gradient Descent	All coordinates, gradient	Fast descent for large-gradient dirs	Cheap per-iteration; can be suboptimal
Coordinate Descent	1D line search or step	Careful descent, high local accuracy	Can stall if no coordinate makes substantive progress
Hybrid	Greedy: GD or LS per-dir	Combines rapid large-step movement with fine local minimization; empirically better per epoch	Coordination cost, higher per-iteration cost if not parallelized

The hybrid interpolates smoothly: as $\tau$ 5, it behaves as pure line search CD; as $\tau$ 6, it recovers pure GD. By treating only small-gradient coordinates with extra care, it avoids the inefficiency of sweeping all coordinates with heavy search at every step.

6. Extensions: Composite, Discrete, and Second-Order Problems

For composite objectives with nonsmooth but separable regularization (e.g., $\tau$ 7), greedy coordinate selection with proximal updates yields linear convergence independent of ambient dimension, implementable via MIPS (Karimireddy et al., 2018).
For discrete optimization (e.g., binary or $\tau$ 8-sparse), the hybrid uses greedy/randomized block selection and then global combinatorial search on the block to escape shallow local minima (Yuan et al., 2017).
Block cubic Newton hybrids employ greedy block selection (Gauss–Southwell) and apply regularized higher-order updates within the block, combining rapid decrease (via second-order information) and computational tractability for large problems (Cristofari, 2024).

7. Practical Implementation Issues and Empirical Observations

The key implementation points include:

Per-coordinate/coordinate-block proposals can be computed in parallel, enabling substantial acceleration via hardware parallelism.
For high-dimensional problems, hybrid schemes admit significant speedups by focusing computation only where it is most effective (e.g., only the most informative coordinates per iteration) (Karimireddy et al., 2018, Mangold et al., 2022).
Empirical results on neural networks, sparse regression, logistic regression, and compressed sensing demonstrate that, for fixed computational budgets, the hybrid typically attains lower losses or support recovery error compared to pure methods (Hsiao et al., 2024, Karimireddy et al., 2018, Yuan et al., 2017, Cristofari, 2024).
The choice of threshold $\tau$ 9 (or, for block methods, the block size $e_j$ 0) critically impacts the tradeoff between per-iteration cost and convergence per epoch.

References

"Hybrid Coordinate Descent for Efficient Neural Network Learning Using Line Search and Gradient Descent" (Hsiao et al., 2024)
"Efficient Greedy Coordinate Descent for Composite Problems" (Karimireddy et al., 2018)
"A Hybrid Method of Combinatorial Search and Coordinate Descent for Discrete Optimization" (Yuan et al., 2017)
"Block cubic Newton with greedy selection" (Cristofari, 2024)
"High-Dimensional Private Empirical Risk Minimization by Greedy Coordinate Descent" (Mangold et al., 2022)