LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (2309.12307v3)

Published 21 Sep 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained LLMs, with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S^2-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA extends Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.

PDF HTML Abstract

Overview of LongLoRA: Efficient Fine-tuning of Long-Context LLMs

The paper presents LongLoRA, a novel approach to efficiently fine-tune LLMs for extended context lengths while minimizing computational overhead. This method addresses the prohibitive computational costs traditionally associated with training LLMs on long-context sequences, such as those required for processing extensive documents or handling complex queries.

Contributions and Techniques

LongLoRA introduces several innovations to achieve efficient and effective fine-tuning:

Shifted Sparse Attention (S $^2$ -Attn) LongLoRA employs a method called Shifted Sparse Attention (S $^2$ -Attn) to reduce the computational burden during fine-tuning. In this mechanism, the context is divided into several groups, and attention is computed within these groups. Half of the attention heads use a shifted grouping mechanism to allow information flow between adjoining groups. This technique approximates the effects of full attention while significantly reducing computation costs. The implementation simplicity of S $^2$ -Attn, requiring only two lines of code, further enhances its appeal.
Parameter-Efficient Fine-Tuning The authors extend the LoRA (Low-Rank Adaptation) framework for long-context fine-tuning by incorporating trainable embedding and normalization layers. This adaptation, referred to as LoRA $^{+}$ in the paper, is crucial for achieving effective long-context adaptation. It significantly narrows the performance gap between LoRA and full fine-tuning, allowing for efficient parameter updates with minimal additional computational requirements.

Empirical Evaluation

The paper provides extensive empirical evaluations demonstrating the efficacy of LongLoRA. Key results include:

Context Extension: LongLoRA successfully extends the context window of Llama2 7B from 4k to 100k tokens and Llama2 70B to 32k tokens using only a single 8 $\times$ A100 machine. The models retain the original architectures and support optimizations such as Flash-Attention2, making them highly compatible with existing techniques.
Performance Metrics: Evaluation on datasets like PG19 and proof-pile shows that models fine-tuned with LongLoRA achieve perplexity values comparable to fully fine-tuned models. For instance, a Llama2 7B model fine-tuned to 32k context length achieves a perplexity of 2.50 on proof-pile, closely matching full attention fine-tuned models.
Efficiency: LongLoRA fine-tuning of Llama2 7B to 100k context length demonstrates up to 1.8 $\times$ lower memory cost and reduced training hours compared to conventional full fine-tuning approaches.

Implications and Future Directions

LongLoRA represents a significant advancement in the domain of efficient fine-tuning for LLMs. Its ability to handle much longer context lengths with reduced computational resources opens doors for various practical applications. These include summarizing extensive documents, handling long-form question answering, and other tasks requiring substantial context comprehension.

Theoretically, the introduction of S $^2$ -Attn and the enhancements to the LoRA framework suggest promising avenues for further research into efficient attention mechanisms and parameter-efficient training strategies. Future work could explore the application of LongLoRA to other LLM architectures and position encoding schemes, further broadening its utility and impact.

Conclusion

The LongLoRA method offers a pragmatic solution to the challenge of extending the context lengths of LLMs while balancing computational efficiency and performance. The combination of S $^2$ -Attn for efficient attention and the improved LoRA $^+$ framework exemplifies a thoughtful approach to addressing the limitations of conventional fine-tuning methods. This work lays a solid foundation for future research aimed at optimizing LLMs for long-context applications, ensuring scalability and accessibility for broader research communities.