MeZO-BCD: Block Coordinate Descent
- Block Coordinate Descent (MeZO-BCD) is a method that partitions parameters into blocks and updates one block at a time using gradient or zeroth-order information.
- It reduces computational load and memory usage by updating a single block per iteration, enabling scalable large-model training and distributed optimization.
- Variants like pairwise comparison and Markov chain block selection offer convergence guarantees across convex and nonconvex settings.
Block Coordinate Descent (MeZO-BCD) refers to a family of optimization methods that partition the parameter space into disjoint blocks and perform updates—either via gradient or zeroth-order (black-box) information—restricted to one block at a time. Specialized instantiations of MeZO-BCD have proven effective for large-scale model training, zeroth-order (ZO) optimization, distributed and parallel systems, and settings where memory or computational constraints preclude full-parameter updates. Variants encompass gradient-based BCD, zeroth-order BCD with finite-difference or pairwise comparison oracles, and stochastic/hybrid block selection rules.
1. Mathematical Formulation and Algorithmic Principles
MeZO-BCD formalizes the unconstrained minimization problem
by partitioning into blocks: , each . At each iteration , an active block is selected (by cyclic, random permutation, or Markov chain), and only this block is updated using local information.
For gradient-based BCD, the update is
with all other blocks frozen. For zeroth-order MeZO-BCD (blockwise finite differences), the estimator for coordinates in is
with a smoothing parameter, and update in block only. In pairwise-comparison MeZO-BCD, blockwise scalar line searches approximate descent directions without access to function values or gradients (Matsui et al., 2014).
2. Zeroth-Order Block Coordinate Descent: Theory and Implementation
For black-box objectives, MeZO-BCD restricts finite-difference or Gaussian smoothing to a block at each iteration, drastically reducing per-step evaluation cost and memory. The method maintains unbiasedness for the smoothed blockwise gradient: where is the -smoothed objective. The per-step estimator variance is bounded by
showing that smaller block sizes dramatically decrease variance per coordinate compared to full finite-difference (Park et al., 31 Jan 2025).
The notion of effective overlap (trace-normalized blockwise Hessian mass) quantifies subspace alignment. If is high, block perturbations align with important curvature directions, preserving descent (Park et al., 31 Jan 2025). Pseudocode for ZO MeZO-BCD iterates over blocks, performs $2k$ loss queries per step (where ), and updates only the active block.
| Variant | Iteration Query Cost | Memory Overhead | Block Update Rule |
|---|---|---|---|
| Full ZO MeZO | $2d$ | All coordinates | Full finite-difference |
| Blockwise MeZO-BCD | Block only | One block per iteration | |
| Pairwise Comp MeZO-BCD | per iter | Block only | Oracle-based line search |
3. Convergence Guarantees
Convergence analysis for MeZO-BCD methods depends on objective smoothness, block update rule, and block selection strategy.
Zeroth-order MeZO-BCD attains the following "dimension-free" convergence (under -smoothness and average effective overlap ) (Park et al., 31 Jan 2025): where is the local Hessian's intrinsic dimension and the block size. For gradient-based BCD (e.g., LLM training), classical results ensure objective descent and stationarity of limits when the gradient is blockwise Lipschitz and step sizes satisfy (Liu et al., 23 May 2025).
Pairwise-comparison MeZO-BCD (BlockCD[n,m]) for strongly convex and -smooth satisfies
with , where is an explicit function of the condition number (Matsui et al., 2014).
Block selection by Markov chain (MC-BCD) with mixing time achieves sublinear rates for nonconvex and convex , and linear convergence under strong convexity, with constants scaling in (Sun et al., 2018).
4. Engineering Optimizations and Large-Scale Training
MeZO-BCD is particularly impactful for training LLMs under hardware constraints (Liu et al., 23 May 2025). Major engineering optimizations include:
- Blockwise pipeline scheduling: Assigning submodel blocks to GPUs enables pipelined execution where only the active block backpropagates, and all others run frozen forward passes.
- Activation caching: Activations of frozen blocks are precomputed and reused, minimizing redundant inference.
- Memory footprint reduction: Only the active block maintains gradient/optimizer state; others are fixed, yielding up to 45% peak memory savings.
- Throughput and cost: Per-iteration throughput on A100/RTX 4090 clusters is up to 1.8× higher than full-parameter updates; total GPU-hour and monetary costs drop by 53–80%, enabling pre-training of models up to 12 B parameters on single 24GB GPUs.
Among key findings: on a 7B-parameter LLaMA, MeZO-BCD reduced hardware cost for pre-training by 28–80% compared to full-update on A100/A800 or RTX 4090 clusters, with identical or slightly improved accuracy (e.g., on Wikipedia, BCD PPL = 65.68 vs. full = 66.84) (Liu et al., 23 May 2025).
5. Parallel, Distributed, and Markov Chain Extensions
Parallel MeZO-BCD exploits the independence of blockwise finite-difference or pairwise comparison steps. In the direction-estimate phase, each updated coordinate per block can be dispatched to separate processors/cores/GPUs for near-linear wall-clock speedup; experiments report 15× speedup using 48 cores on high-dimensional test functions (Matsui et al., 2014).
Distributed and data-local block updates arise naturally when block selection follows a Markov chain (as in decentralized networks or MDPs). When only neighbor-to-neighbor (rather than global i.i.d.) block selection is feasible, MC-BCD ensures convergence with explicit dependence on the mixing time of the underlying chain (Sun et al., 2018). Extensions incorporate inertial (“heavy-ball”) momentum, maintaining sublinear or linear rates depending on convexity assumptions.
6. Applications and Empirical Performance
MeZO-BCD finds broad application across multiple regimes:
- Large model pre-training/fine-tuning: Empirical evidence demonstrates that MeZO-BCD attains the same or better accuracy as full-parameter training in LLMs and requires less memory and cost (Liu et al., 23 May 2025, Park et al., 31 Jan 2025).
- Black-box optimization: On benchmark problems (e.g., Rosenbrock, quadratic objective), MeZO-BCD with large blocks outperforms both coordinate-descent and other derivative-free methods in wall-clock time and iteration count (Matsui et al., 2014).
- Distributed optimization: MC-BCD supports asynchronous, decentralized updates where only local block information is updated by a token traversing a communication network (Sun et al., 2018).
- MDPs and ERM under data limitations: Block updates following Markov trajectories enable reinforcement learning and empirical risk minimization from single randomized sample paths.
- Memory- and bandwidth-constrained hardware: Blockwise activation and optimizer state reduction facilitate scalable model training on consumer-grade GPUs or limited-bandwidth clusters (Liu et al., 23 May 2025).
A summary table demonstrating comparative performance in LLM fine-tuning:
| Task | MeZO | LOZO | MeZO-BCD | Full Adam |
|---|---|---|---|---|
| SST-2 acc | 91.3 | 91.7 | 93.0 | 91.8 |
| RTE acc | 68.2 | 70.4 | 72.6 | 70.9 |
Wall-clock per 1000 iters: MeZO/LOZO 120s; MeZO-BCD 57s (2.1× speedup). Performance parity is consistently observed across a range of standard benchmarks (Park et al., 31 Jan 2025).
7. Methodological Variants and Practical Considerations
- Block size and structure: Setting the block size according to model architectural boundaries (e.g., Transformer layers) achieves favorable trade-offs— improves per-step cost and preserves effective overlap when critical subspaces are captured by the block.
- Block selection rules: Selection can be cyclic, random permutation, Markov chain, or “flip-flop” for stratified coverage. Empirical results indicate minimal sensitivity as long as uniform coverage is maintained (Park et al., 31 Jan 2025, Sun et al., 2018).
- Zeroth-order vs. gradient-based: MeZO-BCD is advantageous for black-box objectives, when gradient computation is unavailable or impractical, and as a memory-efficient alternative in gradient-rich settings with very large models (Liu et al., 23 May 2025).
- Pairwise comparison oracle: For settings lacking objective values, only function orderings, MeZO-BCD leverages line-search-based update steps for each block without loss of convergence (Matsui et al., 2014).
- Numerical stability and hyperparameters: Smoothing , learning rates –, and blockwise RNG reuse are empirically effective. Theoretical step sizes conform to blockwise Lipschitz or Hessian bounds.
A plausible implication is that strided blockwise updates using architectural structure generalize as an effective strategy for both gradient-rich and black-box optimization landscapes.
References:
- (Liu et al., 23 May 2025): "How to Train a Model on a Cheap Cluster with Low Cost using Block Coordinate Descent"
- (Park et al., 31 Jan 2025): "Elucidating Subspace Perturbation in Zeroth-Order Optimization: Theory and Practice at Scale"
- (Matsui et al., 2014): "Parallel Distributed Block Coordinate Descent Methods based on Pairwise Comparison Oracle"
- (Sun et al., 2018): "Markov Chain Block Coordinate Descent"