Papers
Topics
Authors
Recent
2000 character limit reached

Bayesian Subspace Zeroth-Order Optimization

Updated 11 January 2026
  • BSZO is a memory-efficient optimizer that uses Bayesian Kalman filtering to aggregate finite-difference gradient estimates within a low-dimensional subspace for LLM fine-tuning.
  • It achieves provably improved convergence rates by operating in a reduced parameter space with adaptive noise scaling and efficient seed caching to minimize computational cost.
  • Empirical evaluations on models like RoBERTa-large and OPT-13B demonstrate that BSZO outperforms prior zeroth-order methods in accuracy while maintaining inference-level memory footprints.

Bayesian Subspace Zeroth-Order Optimization (BSZO) is a memory-efficient optimizer for fine-tuning LLMs using only function evaluations, bypassing the prohibitive cost of full backpropagation. BSZO advances the field of zeroth-order (ZO) optimization by using a Bayesian (Kalman-filter) approach to aggregate information from multiple directional finite-difference measurements within a randomly projected low-dimensional subspace. This framework leads to provably improved convergence rates and empirically outperforms prior ZO methods across a range of LLMs and tasks, while maintaining memory footprints nearly identical to inference-only baselines (Feng et al., 4 Jan 2026).

1. Stochastic Optimization and Subspace Projection

The goal in BSZO is to minimize a stochastic objective

minθRnL(θ)=EξD[L(θ;ξ)]\min_{\theta \in \mathbb{R}^n} \mathcal{L}(\theta) = \mathbb{E}_{\xi \sim \mathcal{D}}[\mathcal{L}(\theta; \xi)]

where θ\theta parameterizes the LLM and direct gradient computation %%%%1%%%% is memory-prohibitive. To enable efficient gradient estimation, BSZO samples kk Gaussian random directions z1,,zkRnz_1, \ldots, z_k \in \mathbb{R}^n to form a basis B=[z1  zk]Rn×kB = [z_1~\ldots~z_k] \in \mathbb{R}^{n \times k}. The true gradient gg is projected into this kk-dimensional subspace: g~=Bg ,gBg~\tilde{g} = B^\top g\ , \qquad g \approx B\tilde{g} Gradient information is estimated via finite differences along directions dRkd \in \mathbb{R}^k: y^(d)=L(θ+εBd)L(θ)εdg~\hat{y}(d) = \frac{\mathcal{L}(\theta + \varepsilon B d) - \mathcal{L}(\theta)}{\varepsilon} \approx d^\top \tilde{g} This approach reduces both computational and memory cost by focusing on a statistically rich, but tractable, subspace.

2. Bayesian Kalman Filtering for Gradient Aggregation

BSZO interprets each finite-difference measurement y^(d)\hat{y}(d) as a noisy linear observation of the latent subspace-projected gradient g~\tilde{g}. The corresponding stochastic state-space and measurement model are:

  • State transition (static): g~j=g~j1\tilde{g}_j = \tilde{g}_{j-1}, with p(g~0)=N(0,σp2Ik)p(\tilde{g}_0) = \mathcal{N}(0, \sigma_p^2 I_k)
  • Measurement at step jj: yj=djg~+νjy_j = d_j^\top \tilde{g} + \nu_j, with νjN(0,σe2dj2)\nu_j \sim \mathcal{N}(0, \sigma_e^2 \|d_j\|^2)

Given mm such measurements, the posterior over g~\tilde{g} is Gaussian: p(g~Y)=N(μ(m),Σ(m))p(\tilde{g}|Y)=\mathcal{N}(\mu^{(m)},\Sigma^{(m)}) with analytic expressions for the mean μ(m)\mu^{(m)} and covariance Σ(m)\Sigma^{(m)} after batch or sequential (Kalman) updates. The Kalman filter formulation enables efficient sequential incorporation of directional measurements and yields an adaptively improving posterior over the subspace-projected gradient.

3. Adaptive Noise Scaling via Residuals

To address nonstationary or poorly calibrated measurement noise, BSZO computes the residual

rj=yjdjμ(j1)djr_j = \frac{y_j - d_j^\top \mu^{(j-1)}}{\|d_j\|}

and adaptively updates the measurement noise variance via exponential smoothing: (σe(j))2=(1α)(σe(j1))2+αrj2,α(0,1)(\sigma_e^{(j)})^2 = (1-\alpha)(\sigma_e^{(j-1)})^2 + \alpha r_j^2,\qquad \alpha \in (0,1) This residual-based adaptation ensures the posterior covariance Σ(j)\Sigma^{(j)} and downstream learning rate η\eta remain sensitive to empirical noise levels, supporting stable and efficient optimization even in heterogeneous regimes.

4. Full Algorithm and Hyperparameterization

BSZO proceeds by repeatedly constructing the subspace BB, caching reference seeds, and collecting mm (typically kk) directional finite-difference observations per iteration. The algorithm alternates between coordinate axes and adaptively selected directions (based on the maximal covariance diagonal entry), using the computed directional outputs and the Kalman update equations to refine the posterior mean and covariance. Updates to the LLM parameters θ\theta are made along all kk subspace directions, scaled by the posterior mean coefficients.

Default hyperparameter values are found to be robust across a range of architectures (ε=104\varepsilon = 10^{-4}, α=0.1\alpha = 0.1, σp2=1\sigma_p^2 = 1). Only O(k2)O(k^2) memory is needed to store the posterior state, with full perturbation matrices regenerated deterministically via seed caching to avoid O(nk)O(nk) storage.

5. Theoretical Guarantees and Convergence Analysis

Under standard smoothness and bounded-variance assumptions, BSZO achieves a provably faster convergence rate than prior ZO schemes. Specifically, if m=km = k, the expected squared gradient norm averaged over TT iterations satisfies

1Tt=0T1EL(θt)2L(θ0)Lβ(η)ηγkT+Lηγ(n~tr(Σ)+nσε2)2β(η)\frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E}\|\nabla\mathcal{L}(\theta_t)\|^2 \leq \frac{\mathcal{L}(\theta_0) - \mathcal{L}^*}{\beta(\eta)\,\eta\,\gamma\,k\,T} + \frac{L\,\eta\,\gamma\left(\tilde{n} \operatorname{tr}(\Sigma) + n \sigma_\varepsilon^2\right)}{2\beta(\eta)}

with γ=σp2/(σp2+σe2)\gamma = \sigma_p^2/(\sigma_p^2 + \sigma_e^2), n~=n+k+1\tilde{n} = n + k + 1, and β(η)=1(Lηγn~)/2\beta(\eta) = 1 - (L \eta \gamma \tilde{n})/2. For suitable η\eta, this yields an iteration complexity speedup by a factor of k/γk/\gamma relative to standard coordinatewise ZO methods. The expected update direction is in expectation a scaled negative gradient, up to O(ε3)O(\varepsilon^3) bias.

6. Empirical Results and Memory Efficiency

BSZO has been evaluated on RoBERTa-large (355M parameters) and on decoder-only LLMs including OPT-1.3B, Mistral-7B, and OPT-13B across multiple GLUE and SuperGLUE classification/reasoning tasks (SST-2, RTE, CB, COPA, WIC, WSC, TREC). Baseline comparisons involve MeZO (SGD), MeZO-Adam, and HiZOO. Empirical findings include:

Model BSZO Performance MeZO HiZOO MeZO-Adam
RoBERTa 74.94% 74.82% 72.98% 71.33%
OPT-1.3B 74.25% 71.99%
Mistral-7B 77.83% 75.31%
OPT-13B 73.76% 67.09%

GPU memory usage for BSZO is O(n)+O(k2)O(n) + O(k^2), close to MeZO's inference-only footprint (1.00×–1.08×), compared to 1.7×–2.7× for HiZOO/MeZO-Adam, which require auxiliary storage for moments or Hessian approximations. On OPT-13B, BSZO achieves up to 6.67 percentage points absolute average gain versus MeZO, while maintaining low memory consumption (Feng et al., 4 Jan 2026).

7. Limitations, Extensions, and Practical Considerations

BSZO’s subspace dimension kk must balance between representational capacity and per-step computational burden; small kk risks omitting crucial gradient components, but large kk increases the number of required function evaluations. Finite-difference bias (O(εL)O(\varepsilon L)) and the variance floor (tr(Σ)+σε2\operatorname{tr}(\Sigma) + \sigma_\varepsilon^2) set practical accuracy limits. The use of coordinate-axis sampling, while simple, may be suboptimal when optimizing objectives with highly anisotropic curvature.

Potential extensions include subspace learning informed by Hessian or gradient-covariance statistics, heavy-tailed non-Gaussian priors for robust estimation, dynamic adjustment of kk and basis alternation, and the incorporation of local second-order information. Seed caching remains essential for efficient memory usage, and reduced-precision arithmetic may introduce nondeterminism, partially ameliorated by reusing cached outputs.

Hyperparameters show stability across models and are not highly sensitive: ε=104\varepsilon = 10^{-4}, α=0.1\alpha = 0.1, σp2=1\sigma_p^2 = 1 provide strong baseline performance. On 24–96 GB GPUs, BSZO enables high-accuracy LLM fine-tuning with memory strictly bounded by inference cost, outperforming prior ZO-based methods by margins of 2–7 percentage points across models sized 1B–13B.

In summary, BSZO transforms zeroth-order gradient estimation into Bayesian inference in a random subspace, leveraging multiple directional queries, Kalman-style updates, and adaptive noise estimation, yielding both rapid convergence and empirical state-of-the-art ZO fine-tuning of LLMs with near-minimal memory cost (Feng et al., 4 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Bayesian Subspace Zeroth-Order Optimization (BSZO).