MeZO-BCD: Block Coordinate Descent

Updated 27 November 2025

Block Coordinate Descent (MeZO-BCD) is a method that partitions parameters into blocks and updates one block at a time using gradient or zeroth-order information.
It reduces computational load and memory usage by updating a single block per iteration, enabling scalable large-model training and distributed optimization.
Variants like pairwise comparison and Markov chain block selection offer convergence guarantees across convex and nonconvex settings.

Block Coordinate Descent (MeZO-BCD) refers to a family of optimization methods that partition the parameter space into disjoint blocks and perform updates—either via gradient or zeroth-order (black-box) information—restricted to one block at a time. Specialized instantiations of MeZO-BCD have proven effective for large-scale model training, zeroth-order (ZO) optimization, distributed and parallel systems, and settings where memory or computational constraints preclude full-parameter updates. Variants encompass gradient-based BCD, zeroth-order BCD with finite-difference or pairwise comparison oracles, and stochastic/hybrid block selection rules.

1. Mathematical Formulation and Algorithmic Principles

MeZO-BCD formalizes the unconstrained minimization problem

$\min_{\theta \in \mathbb{R}^d} f(\theta)$

by partitioning $\theta$ into $N$ blocks: $\theta = \big(\theta_1, \theta_2, \ldots, \theta_N\big)$ , each $\theta_j \in \mathbb{R}^{k}$ . At each iteration $t$ , an active block $B_{j_t}$ is selected (by cyclic, random permutation, or Markov chain), and only this block is updated using local information.

For gradient-based BCD, the update is

$\theta_{j_t}^{(t+1)} = \theta_{j_t}^{(t)} - \eta \nabla_{\theta_{j_t}} f(\theta^{(t)}),$

with all other blocks frozen. For zeroth-order MeZO-BCD (blockwise finite differences), the estimator for coordinates in $B_{j_t}$ is

$g_{t} = \frac{d}{k} \sum_{i \in B_{j_t}} \frac{f(\theta_{t} + \mu e_i) - f(\theta_{t} - \mu e_i)}{2\mu} e_i$

with $\mu>0$ a smoothing parameter, and update $\theta_{t+1} = \theta_t - \eta_t g_t$ in block $B_{j_t}$ only. In pairwise-comparison MeZO-BCD, blockwise scalar line searches approximate descent directions without access to function values or gradients (Matsui et al., 2014).

2. Zeroth-Order Block Coordinate Descent: Theory and Implementation

For black-box objectives, MeZO-BCD restricts finite-difference or Gaussian smoothing to a block $B$ at each iteration, drastically reducing per-step evaluation cost and memory. The method maintains unbiasedness for the smoothed blockwise gradient: $\mathbb{E}[g_t] = \nabla f_\mu(\theta_t)|_B$ where $f_\mu$ is the $\mu$ -smoothed objective. The per-step estimator variance is bounded by

$\mathbb{E}[\|g_t\|^2] \leq \frac{d}{k}\left(G^2 + \frac{\sigma^2}{B}\right) + O(\mu^2 L^2 d^3 / k^2),$

showing that smaller block sizes dramatically decrease variance per coordinate compared to full finite-difference (Park et al., 31 Jan 2025).

The notion of effective overlap $\rho_t$ (trace-normalized blockwise Hessian mass) quantifies subspace alignment. If $\rho_t$ is high, block perturbations align with important curvature directions, preserving descent (Park et al., 31 Jan 2025). Pseudocode for ZO MeZO-BCD iterates over blocks, performs $2k$ loss queries per step (where $k=|B|$ ), and updates only the active block.

Variant	Iteration Query Cost	Memory Overhead	Block Update Rule
Full ZO MeZO	$2d$	All coordinates	Full finite-difference
Blockwise MeZO-BCD	$2k \ll 2d$	Block only	One block per iteration
Pairwise Comp MeZO-BCD	$O(m\log \frac{1}{\eta})$ per iter	Block only	Oracle-based line search

3. Convergence Guarantees

Convergence analysis for MeZO-BCD methods depends on objective smoothness, block update rule, and block selection strategy.

Zeroth-order MeZO-BCD attains the following "dimension-free" convergence (under $L$ -smoothness and average effective overlap $\bar\rho$ ) (Park et al., 31 Jan 2025): $\frac{1}{T} \sum_{t=1}^T \mathbb{E} \|\nabla f(\theta_t)\|^2 = O\left(\frac{r^2 + \frac{k^2}{d} + 1}{\bar\rho\, T} + \frac{\sigma^2}{B}\right)$ where $r$ is the local Hessian's intrinsic dimension and $k$ the block size. For gradient-based BCD (e.g., LLM training), classical results ensure objective descent and stationarity of limits when the gradient is blockwise Lipschitz and step sizes satisfy $\eta_k < 1/L_k$ (Liu et al., 23 May 2025).

Pairwise-comparison MeZO-BCD (BlockCD[n,m]) for strongly convex and $L$ -smooth $f$ satisfies

$\mathbb{E}[f(x_T)-f^*] \leq \epsilon$

with $T = \lceil \frac{n}{m \gamma} \log \frac{(f(x_0)-f^*)(1 + \frac{n}{m\gamma})}{\epsilon}\rceil$ , where $\gamma$ is an explicit function of the condition number (Matsui et al., 2014).

Block selection by Markov chain (MC-BCD) with mixing time $\tau$ achieves $O(1/k)$ sublinear rates for nonconvex and convex $f$ , and linear convergence under strong convexity, with constants scaling in $\tau$ (Sun et al., 2018).

4. Engineering Optimizations and Large-Scale Training

MeZO-BCD is particularly impactful for training LLMs under hardware constraints (Liu et al., 23 May 2025). Major engineering optimizations include:

Blockwise pipeline scheduling: Assigning submodel blocks to GPUs enables pipelined execution where only the active block backpropagates, and all others run frozen forward passes.
Activation caching: Activations of frozen blocks are precomputed and reused, minimizing redundant inference.
Memory footprint reduction: Only the active block maintains gradient/optimizer state; others are fixed, yielding up to 45% peak memory savings.
Throughput and cost: Per-iteration throughput on A100/RTX 4090 clusters is up to 1.8× higher than full-parameter updates; total GPU-hour and monetary costs drop by 53–80%, enabling pre-training of models up to 12 B parameters on single 24GB GPUs.

Among key findings: on a 7B-parameter LLaMA, MeZO-BCD reduced hardware cost for pre-training by 28–80% compared to full-update on A100/A800 or RTX 4090 clusters, with identical or slightly improved accuracy (e.g., on Wikipedia, BCD PPL = 65.68 vs. full = 66.84) (Liu et al., 23 May 2025).

5. Parallel, Distributed, and Markov Chain Extensions

Parallel MeZO-BCD exploits the independence of blockwise finite-difference or pairwise comparison steps. In the direction-estimate phase, each updated coordinate per block can be dispatched to separate processors/cores/GPUs for near-linear wall-clock speedup; experiments report 15× speedup using 48 cores on high-dimensional test functions (Matsui et al., 2014).

Distributed and data-local block updates arise naturally when block selection follows a Markov chain (as in decentralized networks or MDPs). When only neighbor-to-neighbor (rather than global i.i.d.) block selection is feasible, MC-BCD ensures convergence with explicit dependence on the mixing time $\tau$ of the underlying chain (Sun et al., 2018). Extensions incorporate inertial (“heavy-ball”) momentum, maintaining sublinear or linear rates depending on convexity assumptions.

6. Applications and Empirical Performance

MeZO-BCD finds broad application across multiple regimes:

Large model pre-training/fine-tuning: Empirical evidence demonstrates that MeZO-BCD attains the same or better accuracy as full-parameter training in LLMs and requires less memory and cost (Liu et al., 23 May 2025, Park et al., 31 Jan 2025).
Black-box optimization: On benchmark problems (e.g., Rosenbrock, quadratic objective), MeZO-BCD with large blocks outperforms both coordinate-descent and other derivative-free methods in wall-clock time and iteration count (Matsui et al., 2014).
Distributed optimization: MC-BCD supports asynchronous, decentralized updates where only local block information is updated by a token traversing a communication network (Sun et al., 2018).
MDPs and ERM under data limitations: Block updates following Markov trajectories enable reinforcement learning and empirical risk minimization from single randomized sample paths.
Memory- and bandwidth-constrained hardware: Blockwise activation and optimizer state reduction facilitate scalable model training on consumer-grade GPUs or limited-bandwidth clusters (Liu et al., 23 May 2025).

A summary table demonstrating comparative performance in LLM fine-tuning:

Task	MeZO	LOZO	MeZO-BCD	Full Adam
SST-2 acc	91.3	91.7	93.0	91.8
RTE acc	68.2	70.4	72.6	70.9

Wall-clock per 1000 iters: MeZO/LOZO 120s; MeZO-BCD 57s (2.1× speedup). Performance parity is consistently observed across a range of standard benchmarks (Park et al., 31 Jan 2025).

7. Methodological Variants and Practical Considerations

Block size and structure: Setting the block size $k$ according to model architectural boundaries (e.g., Transformer layers) achieves favorable trade-offs— $k \ll d$ improves per-step cost and preserves effective overlap when critical subspaces are captured by the block.
Block selection rules: Selection can be cyclic, random permutation, Markov chain, or “flip-flop” for stratified coverage. Empirical results indicate minimal sensitivity as long as uniform coverage is maintained (Park et al., 31 Jan 2025, Sun et al., 2018).
Zeroth-order vs. gradient-based: MeZO-BCD is advantageous for black-box objectives, when gradient computation is unavailable or impractical, and as a memory-efficient alternative in gradient-rich settings with very large models (Liu et al., 23 May 2025).
Pairwise comparison oracle: For settings lacking objective values, only function orderings, MeZO-BCD leverages line-search-based update steps for each block without loss of convergence (Matsui et al., 2014).
Numerical stability and hyperparameters: Smoothing $\mu=10^{-3}$ , learning rates $5\times 10^{-6}$ – $10^{-5}$ , and blockwise RNG reuse are empirically effective. Theoretical step sizes conform to blockwise Lipschitz or Hessian bounds.

A plausible implication is that strided blockwise updates using architectural structure generalize as an effective strategy for both gradient-rich and black-box optimization landscapes.

References:

(Liu et al., 23 May 2025): "How to Train a Model on a Cheap Cluster with Low Cost using Block Coordinate Descent"
(Park et al., 31 Jan 2025): "Elucidating Subspace Perturbation in Zeroth-Order Optimization: Theory and Practice at Scale"
(Matsui et al., 2014): "Parallel Distributed Block Coordinate Descent Methods based on Pairwise Comparison Oracle"
(Sun et al., 2018): "Markov Chain Block Coordinate Descent"