Bayesian Subspace Zeroth-Order Optimization

Updated 11 January 2026

BSZO is a memory-efficient optimizer that uses Bayesian Kalman filtering to aggregate finite-difference gradient estimates within a low-dimensional subspace for LLM fine-tuning.
It achieves provably improved convergence rates by operating in a reduced parameter space with adaptive noise scaling and efficient seed caching to minimize computational cost.
Empirical evaluations on models like RoBERTa-large and OPT-13B demonstrate that BSZO outperforms prior zeroth-order methods in accuracy while maintaining inference-level memory footprints.

Bayesian Subspace @@@@1@@@@ (BSZO) is a memory-efficient optimizer for fine-tuning LLMs using only function evaluations, bypassing the prohibitive cost of full backpropagation. BSZO advances the field of zeroth-order (ZO) optimization by using a Bayesian (Kalman-filter) approach to aggregate information from multiple directional finite-difference measurements within a randomly projected low-dimensional subspace. This framework leads to provably improved convergence rates and empirically outperforms prior ZO methods across a range of LLMs and tasks, while maintaining memory footprints nearly identical to inference-only baselines (Feng et al., 4 Jan 2026).

1. Stochastic Optimization and Subspace Projection

The goal in BSZO is to minimize a stochastic objective

$\min_{\theta \in \mathbb{R}^n} \mathcal{L}(\theta) = \mathbb{E}_{\xi \sim \mathcal{D}}[\mathcal{L}(\theta; \xi)]$

where $\theta$ parameterizes the LLM and direct gradient computation $\nabla\mathcal{L}(\theta)$ is memory-prohibitive. To enable efficient gradient estimation, BSZO samples $k$ Gaussian random directions $z_1, \ldots, z_k \in \mathbb{R}^n$ to form a basis $B = [z_1~\ldots~z_k] \in \mathbb{R}^{n \times k}$ . The true gradient $g$ is projected into this $k$ -dimensional subspace: $\tilde{g} = B^\top g\ , \qquad g \approx B\tilde{g}$ Gradient information is estimated via finite differences along directions $d \in \mathbb{R}^k$ : $\hat{y}(d) = \frac{\mathcal{L}(\theta + \varepsilon B d) - \mathcal{L}(\theta)}{\varepsilon} \approx d^\top \tilde{g}$ This approach reduces both computational and memory cost by focusing on a statistically rich, but tractable, subspace.

2. Bayesian Kalman Filtering for Gradient Aggregation

BSZO interprets each finite-difference measurement $\hat{y}(d)$ as a noisy linear observation of the latent subspace-projected gradient $\tilde{g}$ . The corresponding stochastic state-space and measurement model are:

State transition (static): $\tilde{g}_j = \tilde{g}_{j-1}$ , with $p(\tilde{g}_0) = \mathcal{N}(0, \sigma_p^2 I_k)$
Measurement at step $j$ : $y_j = d_j^\top \tilde{g} + \nu_j$ , with $\nu_j \sim \mathcal{N}(0, \sigma_e^2 \|d_j\|^2)$

Given $m$ such measurements, the posterior over $\tilde{g}$ is Gaussian: $p(\tilde{g}|Y)=\mathcal{N}(\mu^{(m)},\Sigma^{(m)})$ with analytic expressions for the mean $\mu^{(m)}$ and covariance $\Sigma^{(m)}$ after batch or sequential (Kalman) updates. The Kalman filter formulation enables efficient sequential incorporation of directional measurements and yields an adaptively improving posterior over the subspace-projected gradient.

3. Adaptive Noise Scaling via Residuals

To address nonstationary or poorly calibrated measurement noise, BSZO computes the residual

$r_j = \frac{y_j - d_j^\top \mu^{(j-1)}}{\|d_j\|}$

and adaptively updates the measurement noise variance via exponential smoothing: $(\sigma_e^{(j)})^2 = (1-\alpha)(\sigma_e^{(j-1)})^2 + \alpha r_j^2,\qquad \alpha \in (0,1)$ This residual-based adaptation ensures the posterior covariance $\Sigma^{(j)}$ and downstream learning rate $\eta$ remain sensitive to empirical noise levels, supporting stable and efficient optimization even in heterogeneous regimes.

4. Full Algorithm and Hyperparameterization

BSZO proceeds by repeatedly constructing the subspace $B$ , caching reference seeds, and collecting $m$ (typically $k$ ) directional finite-difference observations per iteration. The algorithm alternates between coordinate axes and adaptively selected directions (based on the maximal covariance diagonal entry), using the computed directional outputs and the Kalman update equations to refine the posterior mean and covariance. Updates to the LLM parameters $\theta$ are made along all $k$ subspace directions, scaled by the posterior mean coefficients.

Default hyperparameter values are found to be robust across a range of architectures ( $\varepsilon = 10^{-4}$ , $\alpha = 0.1$ , $\sigma_p^2 = 1$ ). Only $O(k^2)$ memory is needed to store the posterior state, with full perturbation matrices regenerated deterministically via seed caching to avoid $O(nk)$ storage.

5. Theoretical Guarantees and Convergence Analysis

Under standard smoothness and bounded-variance assumptions, BSZO achieves a provably faster convergence rate than prior ZO schemes. Specifically, if $m = k$ , the expected squared gradient norm averaged over $T$ iterations satisfies

$\frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E}\|\nabla\mathcal{L}(\theta_t)\|^2 \leq \frac{\mathcal{L}(\theta_0) - \mathcal{L}^*}{\beta(\eta)\,\eta\,\gamma\,k\,T} + \frac{L\,\eta\,\gamma\left(\tilde{n} \operatorname{tr}(\Sigma) + n \sigma_\varepsilon^2\right)}{2\beta(\eta)}$

with $\gamma = \sigma_p^2/(\sigma_p^2 + \sigma_e^2)$ , $\tilde{n} = n + k + 1$ , and $\beta(\eta) = 1 - (L \eta \gamma \tilde{n})/2$ . For suitable $\eta$ , this yields an iteration complexity speedup by a factor of $k/\gamma$ relative to standard coordinatewise ZO methods. The expected update direction is in expectation a scaled negative gradient, up to $O(\varepsilon^3)$ bias.

6. Empirical Results and Memory Efficiency

BSZO has been evaluated on RoBERTa-large (355M parameters) and on decoder-only LLMs including OPT-1.3B, Mistral-7B, and OPT-13B across multiple GLUE and SuperGLUE classification/reasoning tasks (SST-2, RTE, CB, COPA, WIC, WSC, TREC). Baseline comparisons involve MeZO (SGD), MeZO-Adam, and HiZOO. Empirical findings include:

Model	BSZO Performance	MeZO	HiZOO	MeZO-Adam
RoBERTa	74.94%	74.82%	72.98%	71.33%
OPT-1.3B	74.25%	71.99%	–	–
Mistral-7B	77.83%	75.31%	–	–
OPT-13B	73.76%	67.09%	–	–

GPU memory usage for BSZO is $O(n) + O(k^2)$ , close to MeZO's inference-only footprint (1.00×–1.08×), compared to 1.7×–2.7× for HiZOO/MeZO-Adam, which require auxiliary storage for moments or Hessian approximations. On OPT-13B, BSZO achieves up to 6.67 percentage points absolute average gain versus MeZO, while maintaining low memory consumption (Feng et al., 4 Jan 2026).

7. Limitations, Extensions, and Practical Considerations

BSZO’s subspace dimension $k$ must balance between representational capacity and per-step computational burden; small $k$ risks omitting crucial gradient components, but large $k$ increases the number of required function evaluations. Finite-difference bias ( $O(\varepsilon L)$ ) and the variance floor ( $\operatorname{tr}(\Sigma) + \sigma_\varepsilon^2$ ) set practical accuracy limits. The use of coordinate-axis sampling, while simple, may be suboptimal when optimizing objectives with highly anisotropic curvature.

Potential extensions include subspace learning informed by Hessian or gradient-covariance statistics, heavy-tailed non-Gaussian priors for robust estimation, dynamic adjustment of $k$ and basis alternation, and the incorporation of local second-order information. Seed caching remains essential for efficient memory usage, and reduced-precision arithmetic may introduce nondeterminism, partially ameliorated by reusing cached outputs.

Hyperparameters show stability across models and are not highly sensitive: $\varepsilon = 10^{-4}$ , $\alpha = 0.1$ , $\sigma_p^2 = 1$ provide strong baseline performance. On 24–96 GB GPUs, BSZO enables high-accuracy LLM fine-tuning with memory strictly bounded by inference cost, outperforming prior ZO-based methods by margins of 2–7 percentage points across models sized 1B–13B.

In summary, BSZO transforms zeroth-order gradient estimation into Bayesian inference in a random subspace, leveraging multiple directional queries, Kalman-style updates, and adaptive noise estimation, yielding both rapid convergence and empirical state-of-the-art ZO fine-tuning of LLMs with near-minimal memory cost (Feng et al., 4 Jan 2026).

Markdown Upgrade to Chat

References (1)

Bayesian Subspace Gradient Estimation for Zeroth-Order Optimization of Large Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Subspace Zeroth-Order Optimization (BSZO).

Bayesian Subspace Zeroth-Order Optimization

1. Stochastic Optimization and Subspace Projection

2. Bayesian Kalman Filtering for Gradient Aggregation

3. Adaptive Noise Scaling via Residuals

4. Full Algorithm and Hyperparameterization

5. Theoretical Guarantees and Convergence Analysis

6. Empirical Results and Memory Efficiency

7. Limitations, Extensions, and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Bayesian Subspace Zeroth-Order Optimization

1. Stochastic Optimization and Subspace Projection

2. Bayesian Kalman Filtering for Gradient Aggregation

3. Adaptive Noise Scaling via Residuals

4. Full Algorithm and Hyperparameterization

5. Theoretical Guarantees and Convergence Analysis

6. Empirical Results and Memory Efficiency

7. Limitations, Extensions, and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research