LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention (2303.16199v3)

Published 28 Mar 2023 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.MM

Abstract: We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model, and costs less than one hour for fine-tuning on 8 A100 GPUs. Specifically, we adopt a set of learnable adaption prompts, and prepend them to the word tokens at higher transformer layers. Then, a zero-initialized attention mechanism with zero gating is proposed, which adaptively injects the new instructional cues into LLaMA, while effectively preserves its pre-trained knowledge. With our efficient training, LLaMA-Adapter can generate high-quality responses, comparable to Alpaca with fully fine-tuned 7B parameters. Besides language commands, our approach can be simply extended to multi-modal instructions for learning image-conditioned LLaMA model, which achieves superior reasoning performance on ScienceQA and COCO Caption benchmarks. Furthermore, we also evaluate the zero-initialized attention mechanism for fine-tuning other pre-trained models (ViT, RoBERTa) on traditional vision and language tasks, demonstrating the superior generalization capacity of our approach. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.

PDF HTML Abstract

LLaMA-Adapter: A Lightweight Approach to Fine-tuning LLaMA for Instruction-following Tasks

Introduction

The advent of instruction-following models like ChatGPT and Alpaca has highlighted the impressive generative capabilities of LLMs when tailored to understand and respond to commands in natural language. However, the process of adapting these models to specific tasks has traditionally been resource-intensive, both in terms of computational power and time. Addressing this challenge, we introduce the LLaMA-Adapter, a method that efficiently adapts the LLaMA model to become an instruction-following powerhouse with minimal additional parameter load and training time.

Efficient Fine-tuning Strategy

At the heart of LLaMA-Adapter is a novel approach that builds upon the existing LLaMA 7B model, appending learnable adaption prompts to the word tokens at higher transformer layers. A distinctive feature of this method is the use of zero-initialized attention mechanisms, equipped with zero gating, to adaptively introduce new instructions without overwhelming the model's pre-trained knowledge base. This strategy not only ensures the preservation of the original LLaMA abilities but also facilitates the seamless integration of new instructional capabilities.

Key highlights include:

1.2M Parameters: The approach significantly reduces the parameter count to 1.2M learnable components, in stark contrast to the full 7B parameter update as in previous methods like Alpaca.
One-hour Fine-tuning: Leveraging a minimal parameter increase and computational optimizations, fine-tuning LLaMA-Adapter completes in an impressively short time frame on 8 A100 GPUs.
Adaptable Expertise: The method's flexibility lies in its ability to incorporate varied domain-specific knowledge through distinct adapters, negating the need for multiple full-model copies.
Multi-modal Instruction Capability: Extending beyond text, LLaMA-Adapter adeptly handles image-based instructions, paving the way for multi-modal reasoning and applications.

Generalization Across Tasks

Beyond language understanding and generation, LLaMA-Adapter's utility spans to adapting pre-trained models for vision and language tasks. Our experiments demonstrate the model's adeptness in fine-tuning Vision Transformers (ViTs) and RoBERTa for downstream tasks, showcasing exceptional performance improvements and generalization capabilities across domains.

Evaluation and Results

LLaMA-Adapter has been rigorously evaluated against benchmark datasets like ScienceQA, COCO Caption, SQuAD, and VTAB-1k. These evaluations underscore the model's adeptness at instruction-following, multi-modal reasoning, and its performance on traditional vision and language tasks. Comparisons with existing models like Alpaca and various fine-tuning methods affirm LLaMA-Adapter's superior efficiency, accuracy, and versatility.

Discussion and Future Directions

LLaMA-Adapter represents a significant advancement in the efficient adaptation of LLMs for specialized tasks and multi-modal instructions. With its minimal parameter footprint and quick adaptation time, it offers a promising avenue for deploying sophisticated AI capabilities on constrained hardware or in scenarios where rapid model updates are necessary. Looking ahead, we envision further exploration into extending this framework to accommodate a broader array of modalities, including but not limited to audio, video, and 3D data, potentially setting a new standard for creating versatile, instruction-following AI models.

Conclusion

In conclusion, LLaMA-Adapter stands out as a game-changer in the field of fine-tuning LLMs, particularly for instruction-following tasks. It balances efficiency with performance and opens new horizons for the application of LLMs across varied domains, including those requiring quick model adaptation and multi-modal reasoning capabilities. This work not only demonstrates the feasibility of creating more adaptable and responsive AI systems but also lays the groundwork for future innovations in the field.

PDF Markdown Bookmark Chat (Pro)

References (4)

Authors (10)

Renrui Zhang (100 papers)
Jiaming Han (17 papers)
Chris Liu (11 papers)
Peng Gao (401 papers)
Aojun Zhou (45 papers)
Xiangfei Hu (4 papers)
Shilin Yan (20 papers)
Pan Lu (42 papers)
Hongsheng Li (340 papers)
Yu Qiao (563 papers)

Citations (646)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - OpenGVLab/LLaMA-Adapter: [ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters (5,680 stars)

Tweets

YouTube

Show All Videos