Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 47 tok/s

Gemini 2.5 Pro 37 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 11 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 465 tok/s Pro

Claude Sonnet 4 30 tok/s Pro

2000 character limit reached

Zero-Finetuning Framework Overview

Updated 14 August 2025

Zero-Finetuning Framework is a family of methods that leverages gradient-free optimization and adapter-based tuning to adapt large models without traditional backpropagation.
It employs techniques like zeroth-order optimization, adapter modules, and quantization to significantly enhance memory efficiency and scalability on resource-constrained hardware.
Practical applications span vision-language tasks, language modeling, and federated learning, demonstrating robust generalization and state-of-the-art performance.

The Zero-Finetuning Framework refers to a family of methodologies that enable practical and efficient adaptation of large-scale neural models—vision-LLMs and LLMs—to downstream tasks without reliance on traditional first-order backpropagation. These approaches employ gradient-free or “zeroth-order” optimization, adapter-based parameter-efficient fine-tuning, ensemble strategies, and quantization, among others, to achieve low memory overhead, robust generalization, and scalability—even for extremely large models or on resource-constrained hardware. Recent advances demonstrate theoretical and empirical convergence, superior memory efficiency, and state-of-the-art task performance for vision, language, and multi-modal domains.

1. Principles of Zeroth-Order and Zero-Finetuning Optimization

Zeroth-order optimization (ZO) replaces backpropagation with forward-pass-based gradient estimation. Formally, ZO algorithms estimate directional derivatives by perturbing model parameters $\theta$ : $\hat{\nabla}_\theta \mathcal{L}(\theta; \mathcal{B}) = \frac{\mathcal{L}(\theta + \epsilon z; \mathcal{B}) - \mathcal{L}(\theta - \epsilon z; \mathcal{B})}{2\epsilon} \cdot z$ where $z$ is typically sampled from a Gaussian or Rademacher distribution, and $\epsilon$ is a small magnitude (Shang et al., 19 May 2025). Unlike first-order methods, ZO enables fine-tuning with inference-level memory usage and avoids costly storage of intermediate activations and gradients, making it suitable for exceedingly large models and edge devices (Wang et al., 16 Mar 2025, Guo et al., 5 Jun 2024).

Zero-finetuning frameworks further integrate adapter modules, contrastive objectives, sparsity masks, and quantization techniques to maintain task adaptation capabilities with minimal parameter updates and computation overhead (Kim et al., 11 Aug 2024, Zhou et al., 17 Feb 2025).

2. Parameter Efficiency and Adapter Design

Adapter-based zero-finetuning frameworks insert lightweight modules (“R-Adapter”) into each transformer layer of a pretrained model, typically as: $h(X) = XW_{\operatorname{adp}} + X = X(W_{\operatorname{adp}} + I)$ where $W_{\operatorname{adp}} \in \mathbb{R}^{d \times d}$ ; low-rank factorizations $W_{\operatorname{adp}} = BA$ alleviate parameter count (with $r \ll d$ ) (Kim et al., 11 Aug 2024). At inference, re-parameterization allows merging the adapter into the original layer ( $W_{\text{rep}} = W_{\text{org}} (W_{\operatorname{adp}} + I)$ ), incurring no additional runtime cost. Such parameter-efficient approaches substantially reduce overfitting risk and storage, tuning only a small fraction (ca. 13%) of the model (Kim et al., 11 Aug 2024).

3. Memory and Computational Efficiency via ZO and Quantization

Memory optimization is achieved by quantizing weights (e.g., bfloat16 $\rightarrow$ int4), splitting the network, and removing the need for optimizer states (Shang et al., 19 May 2025, Guo et al., 5 Jun 2024). Direct ZO on quantized weights is infeasible due to the precision gap; Quantized Zeroth-order Optimization (QZO) resolves this by perturbing the quantization scale $\Delta$ instead of weights: $\hat{\nabla}_\Delta \mathcal{L}(\Delta \odot \overline{\theta}; \mathcal{B}) = \frac{\mathcal{L}((\Delta + \epsilon z) \odot \overline{\theta}; \mathcal{B}) - \mathcal{L}((\Delta - \epsilon z) \odot \overline{\theta}; \mathcal{B})}{2\epsilon} \cdot z$ Directional derivative clipping is used to stabilize updates. Empirical studies demonstrate reductions in total memory cost by more than $18 \times$ for 4-bit LLMs (Shang et al., 19 May 2025), enabling adaptation of Llama-2-13B and Stable Diffusion 3.5 Large within a single 24GB GPU.

Sparsity is leveraged by identifying “sensitive parameters”—typically the top 0.1% by empirical Fisher information or gradient squared magnitude—and quantizing the remainder (Guo et al., 5 Jun 2024). This achieves superior wall-clock speedup and performance compared to full ZO fine-tuning.

4. Ensemble Methods and Robustness Strategies

Ensuring model robustness and generalization, ensemble strategies are embedded within zero-finetuning adapters (Kim et al., 11 Aug 2024). Three self-ensemble techniques are prominent:

Dynamic Ensemble via Adapter Dropping: Each adapter module is stochastically dropped during training via Bernoulli masking, balancing pretrained and fine-tuned features.
Temporal Ensemble via Accumulation: Adapter weights maintain an EMA (exponential moving average), implicitly averaging over parameter histories.
Weight-space Ensemble via Re-parameterization: Evaluation uses a convex combination of adapter and pretrained weights, interpolating between zero-shot and full fine-tuned states.

These strategies promote OOD generalization (e.g., boosting ImageNet-A and -R robustness by $\approx 1.5$ points) (Kim et al., 11 Aug 2024), with the ensemble effect obtained in-place without storage of multiple model copies.

5. Advanced ZO Algorithms and Convergence

Recent studies have extended ZO with curvature-aware updates, low-rank estimators, and distributed parallel computation:

Hessian-informed ZO (HiZOO): Diagonal Hessian estimates via Taylor expansion scale updates per parameter, adapting step size to local sharpness (Zhao et al., 23 Feb 2024). Theoretical guarantees match classical stochastic optimization in convergence.
Low-Rank ZO (LOZO/LOZO-M): Perturbations with low-rank structure ( $U_lV_l^\top$ ) capture the natural gradient subspaces of LLMs. Momentum terms are projected to the current subspace, incurring negligible memory overhead (Chen et al., 10 Oct 2024). Convergence bounds decrease as $O(T^{-1/2})$ with appropriate hyperparameters.
Fast ZO (FZOO): Batched one-sided gradient estimates with Rademacher perturbations, combined with adaptive normalized-SGD steps, achieve Adam-scale convergence speed while maintaining inference-level memory (Dang et al., 10 Jun 2025). Empirically, FZOO requires $3\times$ fewer forward passes and achieves $+3\%$ accuracy improvement over MeZO.
Distributed ZO (DistZO2): Combines perturbation parallelism (PertP), ZO-adapted distributed data parallelism (DDP), and hardware-aware communication via NVLink slicing to enable fine-tuning of 175B-parameter models at $3\times$ higher throughput (Wang et al., 3 Jul 2025).

6. Practical Applications and Benchmarks

Zero-finetuning frameworks have been validated on diverse tasks:

Image and Vision-Language Tasks: Classification (ImageNet shifts, few-shot learning), cross-modal retrieval (COCO, Flickr30k), and open-vocabulary segmentation (Kim et al., 11 Aug 2024, Goyal et al., 2022).
Language Tasks: GLUE and SuperGLUE classification, multiple-choice, and generative text benchmarks; meta-learning using multitask training (Wang et al., 16 Mar 2025, Dang et al., 10 Jun 2025, Shirkavand et al., 5 Feb 2025).
Edge and Federated Learning: Split-perturbation ZO accelerates convergence in federated settings, applying different perturbation counts to network blocks and uploading only scalar gradients (Ahmed et al., 14 Feb 2025).

Adapter-based frameworks, quantized ZO, and distributed ZO have demonstrated competitive or superior accuracy compared to full first-order fine-tuning, with drastic reductions in memory and computation, robust generalization to OOD distributions, and practical feasibility on limited-resource hardware.

7. Limitations and Future Directions

ZO generally exhibits slower convergence than FO methods—addressed through low-rank estimation, adaptive batching, and exploitation of natural gradient structure (Chen et al., 10 Oct 2024, Dang et al., 10 Jun 2025). Extreme sparsity and quantization demand careful hardware-aware implementation, especially for efficient sparse matrix operations (Guo et al., 5 Jun 2024, Shang et al., 19 May 2025). Integration of zero-finetuning with adaptive optimizers, structured low-bit communication, and further theoretical analysis on multitask and non-differentiable objectives represent key research frontiers (Dang et al., 10 Jun 2025, Shirkavand et al., 5 Feb 2025).

Zero-finetuning stands as a scalable, memory-efficient paradigm for robust model adaptation—covering vision, language, multimodal, and edge deployment scenarios—with a growing suite of rigorously analyzed algorithms and practical open-source implementations.