Discrete Coordinate Descent (DCD)
- Discrete Coordinate Descent (DCD) is a family of algorithms that optimizes high-dimensional discrete and combinatorial problems via sequential blockwise updates while keeping other variables fixed.
- The method employs both random and greedy block selection strategies to balance exploration and exploitation, ensuring convergence to block-stationary points.
- DCD delivers significant computational gains and near-optimal accuracy in applications such as sparse optimization, neural network pruning, adaptive filtering, and combinatorial search.
Discrete Coordinate Descent (DCD) refers to a family of optimization algorithms designed to efficiently solve high-dimensional discrete and combinatorial problems by sequentially or blockwise optimizing a subset of variables while keeping the rest fixed. The approach generalizes and extends classical coordinate descent techniques, traditionally applied to smooth or continuous domains, to settings where the single-variable updates are inherently discrete, non-convex, or combinatorial. In practice, DCD and its variants have been applied successfully to sparse and binary optimization, network activation pruning, adaptive filtering, and robust system identification, often delivering near-optimal accuracy and significant computational gains compared to both continuous relaxations and exhaustive combinatorial methods.
1. Algorithmic Framework and Problem Classes
Discrete Coordinate Descent is built around the principle of decomposing a global optimization problem with a separable or block-structured objective into a sequence of subproblems restricted to a subset of variables—typically called a working set—at each iteration. For a broad class of objectives
where is smooth (frequently convex or strongly convex with -Lipschitz continuous gradient) and imposes discrete or structural constraints (e.g., for binary, for affected sparsity patterns), DCD proceeds by selecting a block of coordinates (size ) and solving the subproblem exactly over :
where 0 denotes the complement variables. For the remaining coordinates, 1. In many applications, such as binary low-rank decompositions, L₀-constrained neural network pruning, or error-in-variables system identification, this allows for problem-specific discrete update rules that are highly efficient and exploit the structure of the subproblem (Yuan et al., 2017, Rakhlin et al., 14 Nov 2025, Arablouei et al., 2014, Yu et al., 2019).
2. Block Selection and Search Strategies
The efficacy and quality of DCD algorithms are tightly linked to the choice of the working set 2 and the search method within each block. Two dominant block selection paradigms exist:
- Random Block Selection: Uniformly sample a block of 3 distinct coordinates at each iteration. This yields unbiased updates and facilitates strong theoretical guarantees for convergence in expectation. Exhaustive search over 4 configurations is feasible for moderate 5.
- Greedy and Hybrid Strategies: Compute for all coordinates the potential gain in objective value if toggled or sparsified; greedily select those with the maximum expected descent (e.g., select 6 from zeros, 7 from nonzeros in sparsity-constrained problems). A hybrid approach alternates random and greedy steps to balance exploration and exploitation (Yuan et al., 2017).
These selection rules enable DCD to target blocks with the greatest local potential for improvement, mitigating common pathologies in purely greedy or random search and enhancing convergence to stronger stationary points.
3. Update Rules and Computational Complexity
In discrete subproblems, exhaustive search enumerates all blockwise assignments (e.g., all 8 for binary), guaranteeing global optimality in 9. For special cases, such as the dichotomous coordinate descent (DCD) employed in adaptive filtering and total least squares, update steps are realized via shift-and-add operations without multiplications, capitalizing on powers-of-two steps for each coordinate:
- Select the coordinate with the maximum residual.
- Apply the largest allowable additive or subtractive binary step (0) to that coordinate, halving on stagnation.
- Repeat until a termination criterion (e.g., max bit-width, max updates) is met.
For block sizes 1 of 2–20, the exhaustive search is computationally feasible and the per-iteration cost remains sub-exponential in problem size 2. This yields a tradeoff between per-step cost and quality of stationary points; larger 3 improves stationarity at the cost of increased search (Yuan et al., 2017, Arablouei et al., 2014). In adaptive filtering applications, for a tapped-delay or structured input, DCD-based updates reach 4 per-sample complexity, matching that of standard LMS filters while retaining the tracking and convergence of RLS (Yu et al., 2019, Arablouei et al., 2014).
| DCD Variant / Application | Main Block Update | Typical Block Size | Per-iteration Complexity |
|---|---|---|---|
| Binary/sparse DCD (Yuan et al., 2017) | Enumeration | 5 | 6 + block selection |
| DCD-RTLS (Arablouei et al., 2014) | Shift-and-add | 7 (coord-wise) | 8 (if shift-structured) |
| DCD-RLS (Yu et al., 2019) | Shift-and-add | 9 | 0 |
| Block-DCD NN pruning (Rakhlin et al., 14 Nov 2025) | Block masking | 1100 | 2 |
4. Optimality, Convergence and Stationarity
DCD algorithms achieve a rigorous form of descent: each block optimization yields a quantifiable decrease in the objective (3), and randomized block selection implies convergence in expectation to block-4 stationary points. Several stationarity notions are established:
- Basic-stationarity: Support-wise minimization for sparsity patterns.
- L-stationarity: Solution to a surrogate quadratic-penalized subproblem.
- Block-5 stationarity: Local minimality in every block of size 6.
An explicit hierarchy connects these: global optimality 7 block-8 stationarity 9 ... 0 block-1 stationarity 1 L-stationarity. For convex 2, and 3 binary/sparse plus box constraints, the expected number of iterations to reach block-4 stationary point is 5 (Yuan et al., 2017).
Surrogate function analysis in neural network pruning guarantees an 6 convergence rate for DCD over 7 block updates, matching known bounds for continuous coordinate descent in convex settings (Rakhlin et al., 14 Nov 2025). In adaptive filtering, the mean-square deviation (MSD) and mean-square stability admit closed-form characterizations, with DCD-converged filters achieving MSDs that match those of exact recursive solutions (Arablouei et al., 2014, Yu et al., 2019).
5. Applications and Empirical Performance
DCD and its specialized instances are effective for high-dimensional problems exhibiting discrete structure or combinatorial constraints:
- Sparse and Binary Optimization: DCD achieves lower objective and estimation error compared to Orthogonal Matching Pursuit (OMP), Proximal Point Algorithms, and other greedy or convex-relaxation methods, both in regularized least-squares and sparsity-constrained regimes. DCD is robust to ill-conditioning and outliers (Yuan et al., 2017).
- Network Pruning and Discrete Neural Architecture Search: Discrete (block) coordinate descent is applied directly to binary mask variables controlling activation function deployment in large neural networks (e.g., ReLU reduction under L₀ constraints). DCD consistently outperforms smooth relaxation and thresholding methods, achieving higher accuracy for a given network sparsity, and provably avoids mask-leakage effects observed in continuous relaxations (Rakhlin et al., 14 Nov 2025).
- Adaptive Filtering and System Identification: Dichotomous coordinate descent, as an inner loop in robust recursive least squares (RLS) and recursive total least-squares (RTLS), dramatically reduces computational burden while maintaining or surpassing traditional line-search methods in tracking and steady-state error. It supports robust weighting and variable-forgetting-factor adaptations crucial for stability under impulsive noise and system change (Arablouei et al., 2014, Yu et al., 2019).
- Dense Subgraph and Combinatorial Optimization: DCD yields the densest subgraphs and lowest objectives across a range of real-world network benchmarks using blockwise combinatorial searches, outperforming greedy heuristics and continuous relaxations (Yuan et al., 2017).
Empirical studies consistently demonstrate that DCD methods, with moderate block sizes and hybrid block selection, converge in orders-of-magnitude fewer iterations than competing methods, and rapidly recover sparse or binary optima with minimal accuracy loss or even improved accuracy relative to baseline algorithms.
6. Theoretical and Practical Considerations
DCD inherits and extends several theoretical advantages of coordinate descent to the discrete or combinatorial domain:
- Descent and Convergence: Each block update guarantees sufficient objective decrease; randomized strategies ensure convergence (in expectation) to block-8 stationary points.
- Complexity Control: Explicit block size and per-block search allow practical tradeoffs between computational feasibility and solution quality; randomization provides coverage of the solution space.
- Robustness: DCD integrates seamlessly with robust penalty schemes (e.g., Huber, MCC) and adaptive mechanisms (e.g., variable forgetting factors in RLS).
- Multiplication-free Execution: Specialized update schemes (e.g., shift-and-add DCD) eliminate multiplications, further reducing complexity in high-dimensional or streaming data contexts.
Default parameter settings (such as block size 9, step control, and averaging rates) are found robust across a wide variety of problems. Warm-start and hybrid block selection strategies enhance convergence and efficiency, especially in large-scale or highly non-uniform settings (Rakhlin et al., 14 Nov 2025, Glasmachers et al., 2014, Yu et al., 2019).
7. Impact on Modern Optimization and Future Directions
Discrete Coordinate Descent has advanced the state-of-the-art in both theoretical understanding and practical application of optimization in discrete domains. By bridging blockwise combinatorial search and coordinate descent with structural adaptivity and complexity control, DCD frameworks have enabled large-scale, scalable solutions to previously intractable problems in machine learning, signal processing, and combinatorial mathematics. The approach's flexibility, descent guarantees, and empirical advantages point to ongoing extensions: integration with mixed-integer programming solvers for larger blocks, adaptive schedules for block size and update frequency, and further generalizations to structured nonconvex domains.
Recent work further demonstrates DCD's utility as a plug-in acceleration module for continuous solvers, a direct attack on L₀-constrained learning tasks, and a robust primitive for online and adaptive algorithms in non-stationary environments (Rakhlin et al., 14 Nov 2025, Yuan et al., 2017, Arablouei et al., 2014, Yu et al., 2019, Glasmachers et al., 2014).