Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (2309.12307v3)

Published 21 Sep 2023 in cs.CL, cs.AI, and cs.LG
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

Abstract: We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained LLMs, with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S2-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA extends Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset.

Overview of LongLoRA: Efficient Fine-tuning of Long-Context LLMs

The paper presents LongLoRA, a novel approach to efficiently fine-tune LLMs for extended context lengths while minimizing computational overhead. This method addresses the prohibitive computational costs traditionally associated with training LLMs on long-context sequences, such as those required for processing extensive documents or handling complex queries.

Contributions and Techniques

LongLoRA introduces several innovations to achieve efficient and effective fine-tuning:

  1. Shifted Sparse Attention (S2^2-Attn) LongLoRA employs a method called Shifted Sparse Attention (S2^2-Attn) to reduce the computational burden during fine-tuning. In this mechanism, the context is divided into several groups, and attention is computed within these groups. Half of the attention heads use a shifted grouping mechanism to allow information flow between adjoining groups. This technique approximates the effects of full attention while significantly reducing computation costs. The implementation simplicity of S2^2-Attn, requiring only two lines of code, further enhances its appeal.
  2. Parameter-Efficient Fine-Tuning The authors extend the LoRA (Low-Rank Adaptation) framework for long-context fine-tuning by incorporating trainable embedding and normalization layers. This adaptation, referred to as LoRA+^{+} in the paper, is crucial for achieving effective long-context adaptation. It significantly narrows the performance gap between LoRA and full fine-tuning, allowing for efficient parameter updates with minimal additional computational requirements.

Empirical Evaluation

The paper provides extensive empirical evaluations demonstrating the efficacy of LongLoRA. Key results include:

  • Context Extension: LongLoRA successfully extends the context window of Llama2 7B from 4k to 100k tokens and Llama2 70B to 32k tokens using only a single 8×\times A100 machine. The models retain the original architectures and support optimizations such as Flash-Attention2, making them highly compatible with existing techniques.
  • Performance Metrics: Evaluation on datasets like PG19 and proof-pile shows that models fine-tuned with LongLoRA achieve perplexity values comparable to fully fine-tuned models. For instance, a Llama2 7B model fine-tuned to 32k context length achieves a perplexity of 2.50 on proof-pile, closely matching full attention fine-tuned models.
  • Efficiency: LongLoRA fine-tuning of Llama2 7B to 100k context length demonstrates up to 1.8×\times lower memory cost and reduced training hours compared to conventional full fine-tuning approaches.

Implications and Future Directions

LongLoRA represents a significant advancement in the domain of efficient fine-tuning for LLMs. Its ability to handle much longer context lengths with reduced computational resources opens doors for various practical applications. These include summarizing extensive documents, handling long-form question answering, and other tasks requiring substantial context comprehension.

Theoretically, the introduction of S2^2-Attn and the enhancements to the LoRA framework suggest promising avenues for further research into efficient attention mechanisms and parameter-efficient training strategies. Future work could explore the application of LongLoRA to other LLM architectures and position encoding schemes, further broadening its utility and impact.

Conclusion

The LongLoRA method offers a pragmatic solution to the challenge of extending the context lengths of LLMs while balancing computational efficiency and performance. The combination of S2^2-Attn for efficient attention and the improved LoRA+^+ framework exemplifies a thoughtful approach to addressing the limitations of conventional fine-tuning methods. This work lays a solid foundation for future research aimed at optimizing LLMs for long-context applications, ensuring scalability and accessibility for broader research communities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Ntk-aware scaled rope, 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
  2. Neural kaleidoscopic space sculpting. In CVPR, pp.  4349–4358, 2023.
  3. L-eval: Instituting standardized evaluation for long context language models, 2023.
  4. Input-tuning: Adapting unfamiliar inputs to frozen pretrained models. CoRR, abs/2203.03131, 2022.
  5. Proof-pile, 2022. URL https://github.com/zhangir-azerbayev/proof-pile.
  6. Layer normalization. CoRR, abs/1607.06450, 2016.
  7. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  8. Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020.
  9. Recurrent memory transformer. In NeurIPS, 2022.
  10. Pixelated butterfly: Simple and efficient sparse training for neural network models. In ICLR, 2022.
  11. Extending context window of large language models via positional interpolation. CoRR, abs/2306.15595, 2023.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019.
  14. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  15. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691, 2023.
  16. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
  17. Longnet: Scaling transformers to 1, 000, 000, 000 tokens. CoRR, abs/2307.02486, 2023.
  18. Glm: General language model pretraining with autoregressive blank infilling. In ACL, pp.  320–335, 2022.
  19. REALM: retrieval-augmented language model pre-training. CoRR, abs/2002.08909, 2020.
  20. Lm-infinite: Simple on-the-fly length generalization for large language models. CoRR, abs/2308.16137, 2023.
  21. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  22. Few-shot learning with retrieval augmented language models. CoRR, abs/2208.03299, 2022.
  23. Dense passage retrieval for open-domain question answering. In EMNLP, pp.  6769–6781, 2020.
  24. Reformer: The efficient transformer. In ICLR, 2020.
  25. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), EMNLP, pp.  3045–3059, 2021.
  26. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
  27. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), ACL, pp.  4582–4597, 2021.
  28. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In NeurIPS, 2022.
  29. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pp.  9992–10002, 2021.
  30. Decoupled weight decay regularization. In ICLR, 2019.
  31. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  32. Landmark attention: Random-access infinite context length for transformers. CoRR, abs/2305.16300, 2023.
  33. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pp.  8024–8035, 2019.
  34. Yarn: Efficient context window extension of large language models. CoRR, abs/2309.00071, 2023.
  35. Train short, test long: Attention with linear biases enables input length extrapolation. In ICLR, 2022.
  36. 3d graph neural networks for RGBD semantic segmentation. In ICCV, pp.  5209–5218, 2017.
  37. Blockwise self-attention for long document understanding. In EMNLP, volume EMNLP 2020 of Findings of ACL, pp.  2555–2565, 2020.
  38. Compressive transformers for long-range sequence modelling. In ICLR, 2020.
  39. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD, pp.  3505–3506. ACM, 2020.
  40. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021.
  41. Training neural networks with fixed sparse masks. In NeurIPS, pp.  24193–24205, 2021.
  42. MosaicML NLP Team. Introducing mpt-30b: Raising the bar for open-source foundation models, 2023a. URL www.mosaicml.com/blog/mpt-30b.
  43. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023b. URL www.mosaicml.com/blog/mpt-7b.
  44. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a.
  45. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b.
  46. Focused transformer: Contrastive training for context scaling. CoRR, abs/2307.03170, 2023.
  47. Attention is all you need. In NeurIPS, pp.  5998–6008, 2017.
  48. Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768, 2020.
  49. Memorizing transformers. In ICLR, 2022.
  50. Big bird: Transformers for longer sequences. In NeurIPS, 2020.
  51. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), ACL, pp.  1–9, 2022.
  52. Safeconv: Explaining and correcting conversational unsafe behavior. In ACL, pp.  22–35, 2023.
  53. Pose: Efficient context window extension of llms via positional skip-wise training, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yukang Chen (43 papers)
  2. Shengju Qian (16 papers)
  3. Haotian Tang (28 papers)
  4. Xin Lai (24 papers)
  5. Zhijian Liu (41 papers)
  6. Song Han (155 papers)
  7. Jiaya Jia (162 papers)
Citations (122)
Youtube Logo Streamline Icon: https://streamlinehq.com