Greedy Block Coordinate Descent (GBCD)
- GBCD is a variable decomposition algorithm that employs greedy active set selection to maximize the objective decrease, enforcing sparsity with zero-norm constraints.
- It incrementally builds active sets by solving a constrained quadratic subproblem, thus reducing computational complexity in dense, high-dimensional settings.
- Empirical studies show GBCD achieves significant speedups and competitive predictive accuracy compared to traditional iterative solvers in large-scale Gaussian Process Regression.
Greedy Block Coordinate Descent (GBCD) is a family of variable decomposition algorithms that use a greedy strategy for selecting and updating blocks of variables in high-dimensional optimization, most notably applied to large-scale Gaussian Process Regression (GPR). In contrast to classical block coordinate descent (BCD) methods—which typically select variable blocks cyclically or by magnitude—GBCD incrementally builds active sets by directly maximizing the expected decrease in the objective function, enforcing sparsity through zero-norm constraints and enabling efficient solution of otherwise computationally intractable dense optimization problems (Bo et al., 2012). GBCD is designed to exploit the structure of kernel matrices and is particularly well-suited for problems where dense Cholesky decomposition or traditional iterative solvers such as Conjugate Gradient (CG) and SMO are infeasible due to scale.
1. Algorithmic Structure and Greedy Active Set Selection
GBCD operates on large-scale quadratic objectives arising in regularized GPR of the form
where is the regularized covariance matrix and the dual variables. At each iteration, GBCD partitions variables into (i) an active set of size to be updated and (ii) the remainder.
The selection of —the defining feature of GBCD—solves a zero-norm (-) constrained optimization problem:
where is the current gradient [Equation (6), (Bo et al., 2012)]. While exact minimization is combinatorially hard, a greedy incremental approach is used: candidate variables are successively added to based on their maximal estimated reduction in the objective. For one-dimensional updates, the decrease can be evaluated analytically as
where accounts for the impact of both the current gradient and prior active variable selections [Equation (10), (Bo et al., 2012)].
This greedy active set construction mitigates redundancy found in cyclic or naive gradient-based selection schemes, as correlated directions are less likely to be chosen unless they provide uniquely large improvements.
2. Greedy Subproblem Formulation via Zero-Norm Constraint
The architecture of GBCD hinges on casting the selection of the active variable set as a zero-norm constrained (i.e., hard -sparse) quadratic optimization:
Explicitly restricting the update to coordinates (rather than, e.g., relying on soft-thresholding or proxies) enforces sparsity at the block level and localizes the computational complexity. Though the original selection is combinatorial, the greedy incremental heuristic ensures that only candidate blocks are considered over an outer iteration—tractable for typical block sizes ().
The resultant subproblem, once is chosen, consists of updating in the -dimensional subspace. This update is performed by solving the small dense linear system corresponding to the active submatrix . Efficient management of via the Woodbury identity is used when growing incrementally [Equations (11)–(13), (Bo et al., 2012)], keeping the block inversion cost at per addition.
3. Quantification of Objective Function Decrease
A critical facet of GBCD's efficiency is the ability to rapidly evaluate the prospective decrease in the objective function upon selecting a candidate variable for . The actual decrease, after removing constants, is
which, in the one-dimensional case, is minimized for
yielding a decrease of . Thus, selection is made greedily by maximal reduction.
The use of such "objective-sensitive" measures, as opposed to pure gradient magnitude, yields better per-iteration progress and is sensitive to conditioning and variable redundancy.
4. Convergence Guarantees and Computational Complexity
Formal analysis provides two main guarantees:
- Per-iteration decrease: For
each update satisfies
ensuring monotonic descent [Theorem 1, (Bo et al., 2012)].
- Global convergence: The entire sequence converges to the unique minimizer of the dense GPR problem [Theorem 2, (Bo et al., 2012)].
Per-iteration computational complexity is dominated by:
Step | Cost per iteration (size ) |
---|---|
Covariance column evaluation | |
Residual/error vector update | |
Active set covariance inversion |
Thus, overall per-active-set cost is . When , the method is scalable to problems with .
5. Empirical Performance Relative to CG, SMO, and Sparse Methods
Empirical experiments demonstrate that GBCD substantially improves training times compared to both dense solvers (CG, SMO, block cyclical and gradient-based coordinate descent) and standard sparse GP approximations:
- On the Sarcos dataset, GBCD attains a 10× speedup over block cyclic BCD and up to 59× over CG for dense GPR training.
- Predictive accuracy as measured by RMSE is on par with Cholesky-based methods (the gold standard), with GBCD reaching similar error in far less training time.
- Compared to sparse GPs (random subset selection, matching pursuit GP), GBCD achieves lower RMSE (e.g., a 55% improvement in RMSE on the Outaouais dataset), despite sparse methods sometimes having an advantage for test-time mean predictions.
- Convergence diagnostics (infinite-norm gradient, objective value, test error) exhibit substantially faster decrease for GBCD versus all baselines.
These results establish that for large, dense GPR, GBCD is preferable when both accuracy and computational tractability are critical.
6. Mathematical Formalism and Technical Implementation
Key mathematical steps:
- Quadratic objective: .
- Active set optimization: For fixed active set , solve for .
- Incremental update of : Employ Woodbury rank-1 updates as new variables are added to [Equations (12)–(13), (Bo et al., 2012)].
- One-dimensional greedy criterion: .
- Global convergence rate: Based on positive definiteness of and block size , with per-iteration decrease quantified by .
Implementation also requires access to (or efficient computation of) covariance matrix columns , and strategies for incremental gradient updates for large . Memory requirements are minimized by only storing and iteratively updating the active set submatrix and corresponding gradient entries.
7. Practical Applications and Scope
GBCD addresses the key bottleneck in dense GPR—the complexity of matrix inversion or naive iterative solutions—by decomposing the problem and updating small, greedily chosen subsets per iteration. Its particular strengths include:
- Large-scale dense kernel learning: Suited for datasets with $100,000+$ points, where standard GPR solvers or memory-demanding sparse approximations are ineffective.
- Flexible infrastructure: GBCD's greedy selection and small-block focus make it adaptable for parallelism and for use in resource-limited computational environments.
- Direct optimization of predictive errors: The greedy block rule tracks objective decrease explicitly, yielding high prediction accuracy with minimal overfitting.
Large-scale empirical studies validate the methodology as robust and computationally superior to both classical iterative and contemporary sparse GP approaches for dense kernel regression.
In summary, GBCD merges greedy, objective-decrease-based block selection with resource-efficient exact quadratic updates, yielding a method for dense GPR that is globally convergent and empirically competitive with or superior to established dense and sparse solvers for large-scale machine learning applications (Bo et al., 2012).