KerZOO: Kernel-Informed ZO Fine-Tuning
- KerZOO is a framework that applies kernel-based bias correction to enable gradient-free fine-tuning of large language models.
- It employs a novel kernel weighting scheme to cancel the dominant O(ε²) bias in gradient estimation, thereby accelerating convergence.
- Empirical evaluations demonstrate improvements with up to 2.9% accuracy gain, 74% fewer iterations, and significant GPU-hour savings over traditional methods.
Kernel-Function-Informed Zeroth-Order Optimization (KerZOO) is a framework for fine-tuning LLMs that addresses the convergence inefficiencies inherent in classical zeroth-order (ZO) optimization. It employs a mathematically-constructed kernel function informed by an analytical bias characterization to enhance gradient estimation, allowing LLM training via forward passes without backpropagation, with significant benefits in memory efficiency and speed of convergence (Mi et al., 24 May 2025).
1. Zeroth-Order Optimization in LLM Fine-Tuning
Zeroth-order (ZO) optimization is a gradient-free paradigm wherein gradients are not computed via backpropagation but are approximated from function evaluations through forward-difference queries. For fine-tuning LLMs, the standard estimator is
where is the loss, is a random perturbation direction, and is the perturbation scale. ZO optimization thus dramatically reduces memory requirements compared to first-order methods.
Despite these resource advantages, classical ZO converges substantially slower. The key limitations are:
- Heterogeneous curvature: High-dimensional parameter spaces of LLMs exhibit widely varying curvature, which affects the informativeness of randomly chosen perturbation directions.
- Lower-order estimation bias: The ZO estimator incurs an bias due to non-ideal sampling of , impeding convergence and dominating error as training progresses.
2. Analytical Characterization of Estimation Bias
Taylor expansion of up to fourth order yields:
so that, after expectation,
Here, the second term denotes the second-order bias (in ), dependent on the distribution of 0. This bias persists across steps, slowing or destabilizing optimization.
3. Kernel-Based Correction in Gradient Estimation
KerZOO introduces a novel estimator that employs a kernel weighting scheme to eliminate the dominant bias term. The unbiased estimator is constructed using a random scalar 1 and a kernel function 2:
3
The expectation of this estimator expands as:
4
By designing 5 such that 6 is a nonzero constant and 7, the leading bias term is cancelled.
A practical family of polynomial kernels is given by
8
with 9, orthonormal Legendre polynomials. For 0,
1
with 2 typically set to 3, which satisfies the moment constraints required to nullify the 4 bias.
4. KerZOO Algorithm and Implementation
KerZOO adopts a Nesterov-style accelerated mirror descent, supporting both full and parameter-efficient fine-tuning (e.g., LoRA):
- Inputs: Initial parameters 5 (6), number of iterations 7, number of probes 8, perturbation scale 9, kernel 0, learning rate 1, gradient clip radius 2.
- For each iteration 3:
- Set acceleration 4.
- Compute mirror descent point: 5.
- Sample 6, 7, 8.
- Estimate batched gradient:
9
- SGD-like update: 0, 1.
- Accelerated update: 2.
Empirical guidance recommends 3, 4, 5, 6 in 7, 8 to 9 for full fine-tuning, and variant-specific values for LoRA. Gradient norm clipping is used to constrain updates.
5. Theoretical Properties
Given 0-smoothness and 1th-order Hölder continuity of 2, and that 3 satisfies the kernel moment conditions, KerZOO attains
4
By contrast, classical ZO estimators suffer from an irreducible 5 bias. The variance of the estimator remains controlled for both approaches, but KerZOO's improvement in bias enables either larger 6 for faster mixing or reduced convergence iterations.
The total iteration complexity to reach target accuracy 7 is
8
where 9, 0 bounds the optimal noise, and 1 depends on 2. Classical ZO requires smaller 3 or a substantially larger 4.
6. Empirical Evaluation
KerZOO demonstrates quantitative improvements over standard ZO baselines. On OPT-2.7B fine-tuning for WSC and MultiRC tasks:
- Accuracy Gain: +2.9% (WSC), +2.6% (MultiRC) relative to MeZO.
- Iteration Reduction: –74% (WSC), –44% (MultiRC).
- GPU-hour Savings: –74% (WSC), –44% (MultiRC).
Across a broader suite of seven SuperGLUE classification tasks and two QA tasks, KerZOO outperforms MeZO and HiZOO in 6 out of 7 classification cases and matches or exceeds performance on generation benchmarks.
| Scenario | Accuracy Gain vs. MeZO | Iteration Reduction | GPU-hour Savings |
|---|---|---|---|
| WSC/OPT-2.7B Fine-tuning | +2.9% | –74% | –74% |
| MultiRC/OPT-2.7B Fine-tuning | +2.6% | –44% | –44% |
Performance is consistent for both full-parameter and parameter-efficient (LoRA) settings.
7. Applications, Guidance, and Future Directions
KerZOO is directly applicable to both full and parameter-efficient fine-tuning of LLMs, with the kernel-corrected estimator enabling rapid adaptation in resource-constrained environments. The approach is compatible with latent adapter architectures such as LoRA and is well-suited for practitioners seeking to avoid high memory costs of backpropagation.
Implementation guidance emphasizes the choice of kernel (5, 6), probe count (7), adaptive tuning of 8, and careful learning rate selection. The algorithm is robust to varying batch sizes and architectural settings.
Potential extensions include adaptation to vision–LLMs, structured model pruning, and model quantization, or any domain where gradient computation forms a computational bottleneck. A plausible implication is that kernel-informed zeroth-order techniques may generalize to a range of gradient-free optimization problems with high-dimensional parameterizations (Mi et al., 24 May 2025).