Memory-Efficient Fine-Tuning of Transformers via Token Selection (2501.18824v1)

Published 31 Jan 2025 in cs.CL and cs.LG

Abstract: Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing methods may reduce certain parts of the memory required for fine-tuning, they still require caching all intermediate activations computed in the forward pass to update weights during the backward pass. In this work, we develop TokenTune, a method to reduce memory usage, specifically the memory to store intermediate activations, in the fine-tuning of transformer-based models. During the backward pass, TokenTune approximates the gradient computation by backpropagating through just a subset of input tokens. Thus, with TokenTune, only a subset of intermediate activations are cached during the forward pass. Also, TokenTune can be easily combined with existing methods like LoRA, further reducing the memory cost. We evaluate our approach on pre-trained transformer models with up to billions of parameters, considering the performance on multiple downstream tasks such as text classification and question answering in a few-shot learning setup. Overall, TokenTune achieves performance on par with full fine-tuning or representative memory-efficient fine-tuning methods, while greatly reducing the memory footprint, especially when combined with other methods with complementary memory reduction mechanisms. We hope that our approach will facilitate the fine-tuning of large transformers, in specializing them for specific domains or co-training them with other neural components from a larger system. Our code is available at https://github.com/facebookresearch/tokentune.

Summary

The paper introduces TokenTune, a novel technique that selectively backpropagates significant tokens to dramatically cut memory usage during transformer fine-tuning.
It details a methodology that reduces retained intermediate activations and integrates with methods like LoRA, achieving up to 79% memory reduction without compromising accuracy.
Empirical tests on benchmarks such as GLUE with models like BERT and Llama2-7B demonstrate TokenTune’s scalability and effectiveness in resource-constrained environments.

Memory-Efficient Fine-Tuning of Transformers via Token Selection

The paper "Memory-Efficient Fine-Tuning of Transformers via Token Selection" introduces TokenTune, an innovative fine-tuning technique aimed at reducing memory consumption in transformer models. Transformer models are prominent for their effectiveness in processing language tasks; however, their fine-tuning can be demanding in memory, especially for models with billions of parameters. TokenTune addresses this challenge by selectively leveraging only a subset of input tokens during backpropagation, thereby minimizing memory usage involved in storing intermediate activations.

Methodology

TokenTune operates by selecting a strategic subset of input tokens to compute gradients during the backpropagation phase. This selection reduces the quantity of intermediate activations that need to be retained in memory. Compared to conventional methods that demand caching all activations, TokenTune significantly lowers memory requirements without adversely impacting model performance. It is particularly noteworthy that TokenTune can be seamlessly integrated with existing memory-reduction techniques like Low-Rank Adapters (LoRA), further enhancing memory efficiency.

The paper elaborates on the specific mechanisms of TokenTune for different layer types. Token caching is strategic, based on the notion that not all tokens equally contribute to error gradients. By focusing computational resources only on a significant token subset, TokenTune achieves robust performance akin to or better than full fine-tuning with a fraction of the memory overhead.

Empirical Analysis

The empirical evaluations underscore TokenTune's effectiveness across various benchmarks and models, including BERT and Llama models with parameter counts ranging from hundreds of millions to billions. Utilizing tasks from the GLUE benchmark, such as text classification and question answering, TokenTune maintains accuracy comparable to full fine-tuning. The experiments further demonstrate TokenTune's ability to reduce memory usage by up to 79% when combined with QLoRA.

The analysis extends to larger models like Llama2-7B, showcasing TokenTune's scalable nature. Through instruction tuning and few-shot evaluations on datasets such as MMLU and ARC, the paper demonstrates TokenTune's viability in fine-tuning LLMs effectively. The integration with existing fine-tuning methods yields satisfactory performance improvements, maintaining or enhancing accuracy while achieving significant reductions in memory load.

Technical Contributions

The TokenTune methodology contributes to the landscape of memory-efficient model adaptation in several ways:

Novel Approach: TokenTune pioneers the token selection method for memory-efficient fine-tuning of transformers, reducing the dependency on intensive memory caching strategies.
Combinability: The approach complements existing techniques like LoRA, offering a holistic memory reduction strategy by engaging with multiple facets of transformer memory requirements.
Effectiveness: Across diverse tasks, TokenTune demonstrates substantial memory savings, achieving competitive accuracy with traditional full fine-tuning approaches.

Implications and Future Directions

TokenTune has significant implications for the development and deployment of LLMs in contexts where computational resources are limited. By diminishing memory constraints, it allows broader accessibility and scalability of developing domain-specific models without necessitating expansive hardware infrastructure.

The future exploration can focus on refining token selection algorithms, exploring dynamic selection mechanisms, and extending the application scope to other domains, such as vision transformers. Further experimentation on real-world large-scale language applications could also provide insights into additional optimizations, potentially unlocking new use cases and efficiencies in AI deployment.

In summary, TokenTune stands as a promising strategy for advancing the memory efficiency of transformer model fine-tuning, aligning with the trend toward more accessible and resource-conscious AI technologies.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers