Bayesian Subspace Zeroth-Order Optimization
- BSZO is a memory-efficient optimizer that uses Bayesian Kalman filtering to aggregate finite-difference gradient estimates within a low-dimensional subspace for LLM fine-tuning.
- It achieves provably improved convergence rates by operating in a reduced parameter space with adaptive noise scaling and efficient seed caching to minimize computational cost.
- Empirical evaluations on models like RoBERTa-large and OPT-13B demonstrate that BSZO outperforms prior zeroth-order methods in accuracy while maintaining inference-level memory footprints.
Bayesian Subspace Zeroth-Order Optimization (BSZO) is a memory-efficient optimizer for fine-tuning LLMs using only function evaluations, bypassing the prohibitive cost of full backpropagation. BSZO advances the field of zeroth-order (ZO) optimization by using a Bayesian (Kalman-filter) approach to aggregate information from multiple directional finite-difference measurements within a randomly projected low-dimensional subspace. This framework leads to provably improved convergence rates and empirically outperforms prior ZO methods across a range of LLMs and tasks, while maintaining memory footprints nearly identical to inference-only baselines (Feng et al., 4 Jan 2026).
1. Stochastic Optimization and Subspace Projection
The goal in BSZO is to minimize a stochastic objective
where parameterizes the LLM and direct gradient computation %%%%1%%%% is memory-prohibitive. To enable efficient gradient estimation, BSZO samples Gaussian random directions to form a basis . The true gradient is projected into this -dimensional subspace: Gradient information is estimated via finite differences along directions : This approach reduces both computational and memory cost by focusing on a statistically rich, but tractable, subspace.
2. Bayesian Kalman Filtering for Gradient Aggregation
BSZO interprets each finite-difference measurement as a noisy linear observation of the latent subspace-projected gradient . The corresponding stochastic state-space and measurement model are:
- State transition (static): , with
- Measurement at step : , with
Given such measurements, the posterior over is Gaussian: with analytic expressions for the mean and covariance after batch or sequential (Kalman) updates. The Kalman filter formulation enables efficient sequential incorporation of directional measurements and yields an adaptively improving posterior over the subspace-projected gradient.
3. Adaptive Noise Scaling via Residuals
To address nonstationary or poorly calibrated measurement noise, BSZO computes the residual
and adaptively updates the measurement noise variance via exponential smoothing: This residual-based adaptation ensures the posterior covariance and downstream learning rate remain sensitive to empirical noise levels, supporting stable and efficient optimization even in heterogeneous regimes.
4. Full Algorithm and Hyperparameterization
BSZO proceeds by repeatedly constructing the subspace , caching reference seeds, and collecting (typically ) directional finite-difference observations per iteration. The algorithm alternates between coordinate axes and adaptively selected directions (based on the maximal covariance diagonal entry), using the computed directional outputs and the Kalman update equations to refine the posterior mean and covariance. Updates to the LLM parameters are made along all subspace directions, scaled by the posterior mean coefficients.
Default hyperparameter values are found to be robust across a range of architectures (, , ). Only memory is needed to store the posterior state, with full perturbation matrices regenerated deterministically via seed caching to avoid storage.
5. Theoretical Guarantees and Convergence Analysis
Under standard smoothness and bounded-variance assumptions, BSZO achieves a provably faster convergence rate than prior ZO schemes. Specifically, if , the expected squared gradient norm averaged over iterations satisfies
with , , and . For suitable , this yields an iteration complexity speedup by a factor of relative to standard coordinatewise ZO methods. The expected update direction is in expectation a scaled negative gradient, up to bias.
6. Empirical Results and Memory Efficiency
BSZO has been evaluated on RoBERTa-large (355M parameters) and on decoder-only LLMs including OPT-1.3B, Mistral-7B, and OPT-13B across multiple GLUE and SuperGLUE classification/reasoning tasks (SST-2, RTE, CB, COPA, WIC, WSC, TREC). Baseline comparisons involve MeZO (SGD), MeZO-Adam, and HiZOO. Empirical findings include:
| Model | BSZO Performance | MeZO | HiZOO | MeZO-Adam |
|---|---|---|---|---|
| RoBERTa | 74.94% | 74.82% | 72.98% | 71.33% |
| OPT-1.3B | 74.25% | 71.99% | – | – |
| Mistral-7B | 77.83% | 75.31% | – | – |
| OPT-13B | 73.76% | 67.09% | – | – |
GPU memory usage for BSZO is , close to MeZO's inference-only footprint (1.00×–1.08×), compared to 1.7×–2.7× for HiZOO/MeZO-Adam, which require auxiliary storage for moments or Hessian approximations. On OPT-13B, BSZO achieves up to 6.67 percentage points absolute average gain versus MeZO, while maintaining low memory consumption (Feng et al., 4 Jan 2026).
7. Limitations, Extensions, and Practical Considerations
BSZO’s subspace dimension must balance between representational capacity and per-step computational burden; small risks omitting crucial gradient components, but large increases the number of required function evaluations. Finite-difference bias () and the variance floor () set practical accuracy limits. The use of coordinate-axis sampling, while simple, may be suboptimal when optimizing objectives with highly anisotropic curvature.
Potential extensions include subspace learning informed by Hessian or gradient-covariance statistics, heavy-tailed non-Gaussian priors for robust estimation, dynamic adjustment of and basis alternation, and the incorporation of local second-order information. Seed caching remains essential for efficient memory usage, and reduced-precision arithmetic may introduce nondeterminism, partially ameliorated by reusing cached outputs.
Hyperparameters show stability across models and are not highly sensitive: , , provide strong baseline performance. On 24–96 GB GPUs, BSZO enables high-accuracy LLM fine-tuning with memory strictly bounded by inference cost, outperforming prior ZO-based methods by margins of 2–7 percentage points across models sized 1B–13B.
In summary, BSZO transforms zeroth-order gradient estimation into Bayesian inference in a random subspace, leveraging multiple directional queries, Kalman-style updates, and adaptive noise estimation, yielding both rapid convergence and empirical state-of-the-art ZO fine-tuning of LLMs with near-minimal memory cost (Feng et al., 4 Jan 2026).