Scaling Sparse Fine-Tuning to Large Language Models (2401.16405v2)

Published 29 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs are difficult to fully fine-tune (e.g., with instructions or human feedback) due to their sheer number of parameters. A family of parameter-efficient sparse fine-tuning methods have proven promising in terms of performance but their memory requirements increase proportionally to the size of the LLMs. In this work, we scale sparse fine-tuning to state-of-the-art LLMs like LLaMA 2 7B and 13B. We propose SpIEL, a novel sparse fine-tuning method which, for a desired density level, maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values. It iterates over: (a) updating the active deltas, (b) pruning indices (based on the change of magnitude of their deltas) and (c) regrowth of indices. For regrowth, we explore two criteria based on either the accumulated gradients of a few candidate parameters or their approximate momenta estimated using the efficient SM3 optimizer. We experiment with instruction-tuning of LLMs on standard dataset mixtures, finding that SpIEL is often superior to popular parameter-efficient fine-tuning methods like LoRA (low-rank adaptation) in terms of performance and comparable in terms of run time. We additionally show that SpIEL is compatible with both quantization and efficient optimizers, to facilitate scaling to ever-larger model sizes. We release the code for SpIEL at https://github.com/AlanAnsell/peft and for the instruction-tuning experiments at https://github.com/ducdauge/sft-LLM.

PDF Abstract

Parameter-Efficient Sparse Fine-Tuning

The substantial parameter count of LLMs like Falcon, LLaMA 2, and Mistral necessitates refined fine-tuning approaches that bypass the arduous task of updating the entirety of an LLM's parameters. Sparse Fine-Tuning (SFT) has been recognized for striking a balance between parameter economy and robust model performance. Despite this, memory demands expand in proportion to LLM size, limiting the scalability of SFT. Addressing this constraint, this work scales SFT to LLMs eyeing memory-efficient methods.

Memory-Efficient SFT

A novel iterative paradigm for SFT is introduced, which cycles through updating active parameter deltas, pruning indices based on delta magnitude change, and regrowth of indices. The regrowth employs criteria based on gradients or estimated momenta via SM3 optimizer, distinguishing the process from dense pretraining methods. These alterations yield a dense model from an operation standpoint, side-stepping inefficient sparse tensor operations commonly faced in hardware.

Experimental Validation

The efficiency of SFT is tested via instruction-tuning on standard dataset mixtures. Experiments show that SFT often outperforms established parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA), particularly in performance and runtime comparability. Compatibility with quantization and efficient optimizers is also demonstrated, scaling SFT to LLM sizes previously deemed impractical due to memory constraints.

Quantization and Efficiency Results

The introduction of quantization to SFT, denoted as "qSFT", underscores the method's adaptability to drastically memory-constrained environments. qSFT maintains competitive performance when pitted against 4-bit quantized LLMs fine-tuned using other techniques. The approach also showcases a symbiotic relationship with activation checkpointing, offering insights into prioritizing techniques to optimize the memory efficiency of LLMs.

This research sets the stage for SFT as a leading strategy both in parameter and memory efficiency for LLM adaptation. Further exploration could refine growth criteria and extend SFT's applicability across all model parameters, including embedding layers, marking a significant advancement in the field's ongoing refinement of LLM fine-tuning methodologies.