Fine-Tuning Language Models with Just Forward Passes (2305.17333v3)

Published 27 May 2023 in cs.LG and cs.CL

Abstract: Fine-tuning LLMs (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.

PDF Abstract

Fine-Tuning LLMs with Just Forward Passes

The paper addresses the growing computational challenge associated with fine-tuning LLMs (LMs), which has traditionally relied on backpropagation techniques that demand extensive memory resources. As these models scale up, the memory requirements for backpropagation increase prohibitively, creating a bottleneck in specific hardware environments. To tackle this issue, the authors propose a novel approach termed the memory-efficient zeroth-order optimizer (MeZO), adapted from the classical zeroth-order stochastic gradient descent (ZO-SGD) algorithm, capable of fine-tuning large LMs using memory comparable to that required for inference.

Key Contributions

Memory Efficiency: The MeZO algorithm is designed to operate entirely in-place, replicating the memory footprint of inference rather than the substantial requirements of backpropagation. This significant reduction in memory usage allows the fine-tuning of much larger models than previously feasible on a single GPU. For instance, MeZO enables training a 30-billion parameter LM on an Nvidia A100 80GB GPU, compared to the 2.7 billion limit of traditional methods like Adam for backpropagation under the same constraints.
Empirical Validation: Comprehensive experiments demonstrate that MeZO competes with, and in many instances, performs comparably to standard backpropagation-based fine-tuning across a diverse set of model architectures, scales, and tasks. The technique achieves up to 12 times memory reduction and up to double the GPU-hour efficiency in some contexts. The experiments cover various LM types, model sizes up to 66 billion parameters, and tasks ranging from classification to generation.
Compatibility & Extensibility: MeZO is versatile, supporting both full-parameter tuning and parameter-efficient techniques such as LoRA (Low-Rank Adaptation) and prefix tuning. This compatibility indicates potential utility in resource-constrained environments or applications requiring parameter-efficient adaptations.
Optimization for Non-Differentiable Objectives: The approach is notably effective in optimizing non-differentiable objectives like accuracy and F1 score, showcasing its broader applicability beyond tasks limited to differentiable loss functions.
Theoretical Insights: The authors provide theoretic underpinnings for MeZO's efficiency, suggesting the effective rank of the landscape rather than the parameter count drives the optimization process. Adequate pre-training and task prompting facilitate this efficiency, contradicting traditional expectations that zeroth-order methods scale poorly with increased parameter dimensions.

Implications and Future Directions

The introduction of MeZO opens several avenues for research and practical applications:

Scalability in Resource-Constrained Settings: MeZO potentially democratizes access to powerful model fine-tuning capabilities, enabling entities without vast computational resources to leverage state-of-the-art models effectively.
Opportunities in Non-Differentiable Optimization: By enabling the optimization of non-differentiable objectives, MeZO offers new methodologies for tailoring models to align more closely with human preferences or operational criteria that resist easy formulation in differentiable terms.
Model Interpretability and Pruning: As a gradient estimation technique, MeZO might provide novel insights into model interpretability or pruning strategies by highlighting responsive parameters directly through forward pass evaluations.
Integration with Other Memory-Efficient Techniques: Future work could investigate the combination of MeZO with other advancements like FlashAttention and quantization to push the boundaries of memory efficiency further without sacrificing model fidelity.

Overall, this work positions MeZO as an efficient, memory-conscious alternative to traditional fine-tuning methods, bringing the potentials of massive LM adaptations within closer reach of a broader array of researchers and practitioners in artificial intelligence.