Papers
Topics
Authors
Recent
2000 character limit reached

MeZO-BCD: Block Coordinate Descent

Updated 27 November 2025
  • Block Coordinate Descent (MeZO-BCD) is a method that partitions parameters into blocks and updates one block at a time using gradient or zeroth-order information.
  • It reduces computational load and memory usage by updating a single block per iteration, enabling scalable large-model training and distributed optimization.
  • Variants like pairwise comparison and Markov chain block selection offer convergence guarantees across convex and nonconvex settings.

Block Coordinate Descent (MeZO-BCD) refers to a family of optimization methods that partition the parameter space into disjoint blocks and perform updates—either via gradient or zeroth-order (black-box) information—restricted to one block at a time. Specialized instantiations of MeZO-BCD have proven effective for large-scale model training, zeroth-order (ZO) optimization, distributed and parallel systems, and settings where memory or computational constraints preclude full-parameter updates. Variants encompass gradient-based BCD, zeroth-order BCD with finite-difference or pairwise comparison oracles, and stochastic/hybrid block selection rules.

1. Mathematical Formulation and Algorithmic Principles

MeZO-BCD formalizes the unconstrained minimization problem

minθRdf(θ)\min_{\theta \in \mathbb{R}^d} f(\theta)

by partitioning θ\theta into NN blocks: θ=(θ1,θ2,,θN)\theta = \big(\theta_1, \theta_2, \ldots, \theta_N\big), each θjRk\theta_j \in \mathbb{R}^{k}. At each iteration tt, an active block BjtB_{j_t} is selected (by cyclic, random permutation, or Markov chain), and only this block is updated using local information.

For gradient-based BCD, the update is

θjt(t+1)=θjt(t)ηθjtf(θ(t)),\theta_{j_t}^{(t+1)} = \theta_{j_t}^{(t)} - \eta \nabla_{\theta_{j_t}} f(\theta^{(t)}),

with all other blocks frozen. For zeroth-order MeZO-BCD (blockwise finite differences), the estimator for coordinates in BjtB_{j_t} is

gt=dkiBjtf(θt+μei)f(θtμei)2μeig_{t} = \frac{d}{k} \sum_{i \in B_{j_t}} \frac{f(\theta_{t} + \mu e_i) - f(\theta_{t} - \mu e_i)}{2\mu} e_i

with μ>0\mu>0 a smoothing parameter, and update θt+1=θtηtgt\theta_{t+1} = \theta_t - \eta_t g_t in block BjtB_{j_t} only. In pairwise-comparison MeZO-BCD, blockwise scalar line searches approximate descent directions without access to function values or gradients (Matsui et al., 2014).

2. Zeroth-Order Block Coordinate Descent: Theory and Implementation

For black-box objectives, MeZO-BCD restricts finite-difference or Gaussian smoothing to a block BB at each iteration, drastically reducing per-step evaluation cost and memory. The method maintains unbiasedness for the smoothed blockwise gradient: E[gt]=fμ(θt)B\mathbb{E}[g_t] = \nabla f_\mu(\theta_t)|_B where fμf_\mu is the μ\mu-smoothed objective. The per-step estimator variance is bounded by

E[gt2]dk(G2+σ2B)+O(μ2L2d3/k2),\mathbb{E}[\|g_t\|^2] \leq \frac{d}{k}\left(G^2 + \frac{\sigma^2}{B}\right) + O(\mu^2 L^2 d^3 / k^2),

showing that smaller block sizes dramatically decrease variance per coordinate compared to full finite-difference (Park et al., 31 Jan 2025).

The notion of effective overlap ρt\rho_t (trace-normalized blockwise Hessian mass) quantifies subspace alignment. If ρt\rho_t is high, block perturbations align with important curvature directions, preserving descent (Park et al., 31 Jan 2025). Pseudocode for ZO MeZO-BCD iterates over blocks, performs $2k$ loss queries per step (where k=Bk=|B|), and updates only the active block.

Variant Iteration Query Cost Memory Overhead Block Update Rule
Full ZO MeZO $2d$ All coordinates Full finite-difference
Blockwise MeZO-BCD 2k2d2k \ll 2d Block only One block per iteration
Pairwise Comp MeZO-BCD O(mlog1η)O(m\log \frac{1}{\eta}) per iter Block only Oracle-based line search

3. Convergence Guarantees

Convergence analysis for MeZO-BCD methods depends on objective smoothness, block update rule, and block selection strategy.

Zeroth-order MeZO-BCD attains the following "dimension-free" convergence (under LL-smoothness and average effective overlap ρˉ\bar\rho) (Park et al., 31 Jan 2025): 1Tt=1TEf(θt)2=O(r2+k2d+1ρˉT+σ2B)\frac{1}{T} \sum_{t=1}^T \mathbb{E} \|\nabla f(\theta_t)\|^2 = O\left(\frac{r^2 + \frac{k^2}{d} + 1}{\bar\rho\, T} + \frac{\sigma^2}{B}\right) where rr is the local Hessian's intrinsic dimension and kk the block size. For gradient-based BCD (e.g., LLM training), classical results ensure objective descent and stationarity of limits when the gradient is blockwise Lipschitz and step sizes satisfy ηk<1/Lk\eta_k < 1/L_k (Liu et al., 23 May 2025).

Pairwise-comparison MeZO-BCD (BlockCD[n,m]) for strongly convex and LL-smooth ff satisfies

E[f(xT)f]ϵ\mathbb{E}[f(x_T)-f^*] \leq \epsilon

with T=nmγlog(f(x0)f)(1+nmγ)ϵT = \lceil \frac{n}{m \gamma} \log \frac{(f(x_0)-f^*)(1 + \frac{n}{m\gamma})}{\epsilon}\rceil, where γ\gamma is an explicit function of the condition number (Matsui et al., 2014).

Block selection by Markov chain (MC-BCD) with mixing time τ\tau achieves O(1/k)O(1/k) sublinear rates for nonconvex and convex ff, and linear convergence under strong convexity, with constants scaling in τ\tau (Sun et al., 2018).

4. Engineering Optimizations and Large-Scale Training

MeZO-BCD is particularly impactful for training LLMs under hardware constraints (Liu et al., 23 May 2025). Major engineering optimizations include:

  • Blockwise pipeline scheduling: Assigning submodel blocks to GPUs enables pipelined execution where only the active block backpropagates, and all others run frozen forward passes.
  • Activation caching: Activations of frozen blocks are precomputed and reused, minimizing redundant inference.
  • Memory footprint reduction: Only the active block maintains gradient/optimizer state; others are fixed, yielding up to 45% peak memory savings.
  • Throughput and cost: Per-iteration throughput on A100/RTX 4090 clusters is up to 1.8× higher than full-parameter updates; total GPU-hour and monetary costs drop by 53–80%, enabling pre-training of models up to 12 B parameters on single 24GB GPUs.

Among key findings: on a 7B-parameter LLaMA, MeZO-BCD reduced hardware cost for pre-training by 28–80% compared to full-update on A100/A800 or RTX 4090 clusters, with identical or slightly improved accuracy (e.g., on Wikipedia, BCD PPL = 65.68 vs. full = 66.84) (Liu et al., 23 May 2025).

5. Parallel, Distributed, and Markov Chain Extensions

Parallel MeZO-BCD exploits the independence of blockwise finite-difference or pairwise comparison steps. In the direction-estimate phase, each updated coordinate per block can be dispatched to separate processors/cores/GPUs for near-linear wall-clock speedup; experiments report 15× speedup using 48 cores on high-dimensional test functions (Matsui et al., 2014).

Distributed and data-local block updates arise naturally when block selection follows a Markov chain (as in decentralized networks or MDPs). When only neighbor-to-neighbor (rather than global i.i.d.) block selection is feasible, MC-BCD ensures convergence with explicit dependence on the mixing time τ\tau of the underlying chain (Sun et al., 2018). Extensions incorporate inertial (“heavy-ball”) momentum, maintaining sublinear or linear rates depending on convexity assumptions.

6. Applications and Empirical Performance

MeZO-BCD finds broad application across multiple regimes:

  • Large model pre-training/fine-tuning: Empirical evidence demonstrates that MeZO-BCD attains the same or better accuracy as full-parameter training in LLMs and requires less memory and cost (Liu et al., 23 May 2025, Park et al., 31 Jan 2025).
  • Black-box optimization: On benchmark problems (e.g., Rosenbrock, quadratic objective), MeZO-BCD with large blocks outperforms both coordinate-descent and other derivative-free methods in wall-clock time and iteration count (Matsui et al., 2014).
  • Distributed optimization: MC-BCD supports asynchronous, decentralized updates where only local block information is updated by a token traversing a communication network (Sun et al., 2018).
  • MDPs and ERM under data limitations: Block updates following Markov trajectories enable reinforcement learning and empirical risk minimization from single randomized sample paths.
  • Memory- and bandwidth-constrained hardware: Blockwise activation and optimizer state reduction facilitate scalable model training on consumer-grade GPUs or limited-bandwidth clusters (Liu et al., 23 May 2025).

A summary table demonstrating comparative performance in LLM fine-tuning:

Task MeZO LOZO MeZO-BCD Full Adam
SST-2 acc 91.3 91.7 93.0 91.8
RTE acc 68.2 70.4 72.6 70.9

Wall-clock per 1000 iters: MeZO/LOZO 120s; MeZO-BCD 57s (2.1× speedup). Performance parity is consistently observed across a range of standard benchmarks (Park et al., 31 Jan 2025).

7. Methodological Variants and Practical Considerations

  • Block size and structure: Setting the block size kk according to model architectural boundaries (e.g., Transformer layers) achieves favorable trade-offs—kdk \ll d improves per-step cost and preserves effective overlap when critical subspaces are captured by the block.
  • Block selection rules: Selection can be cyclic, random permutation, Markov chain, or “flip-flop” for stratified coverage. Empirical results indicate minimal sensitivity as long as uniform coverage is maintained (Park et al., 31 Jan 2025, Sun et al., 2018).
  • Zeroth-order vs. gradient-based: MeZO-BCD is advantageous for black-box objectives, when gradient computation is unavailable or impractical, and as a memory-efficient alternative in gradient-rich settings with very large models (Liu et al., 23 May 2025).
  • Pairwise comparison oracle: For settings lacking objective values, only function orderings, MeZO-BCD leverages line-search-based update steps for each block without loss of convergence (Matsui et al., 2014).
  • Numerical stability and hyperparameters: Smoothing μ=103\mu=10^{-3}, learning rates 5×1065\times 10^{-6}10510^{-5}, and blockwise RNG reuse are empirically effective. Theoretical step sizes conform to blockwise Lipschitz or Hessian bounds.

A plausible implication is that strided blockwise updates using architectural structure generalize as an effective strategy for both gradient-rich and black-box optimization landscapes.


References:

  • (Liu et al., 23 May 2025): "How to Train a Model on a Cheap Cluster with Low Cost using Block Coordinate Descent"
  • (Park et al., 31 Jan 2025): "Elucidating Subspace Perturbation in Zeroth-Order Optimization: Theory and Practice at Scale"
  • (Matsui et al., 2014): "Parallel Distributed Block Coordinate Descent Methods based on Pairwise Comparison Oracle"
  • (Sun et al., 2018): "Markov Chain Block Coordinate Descent"
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Block Coordinate Descent (MeZO-BCD).