Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training (2309.10400v3)

Published 19 Sep 2023 in cs.CL and cs.LG
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

Abstract: LLMs are trained with a pre-defined context length, restricting their use in scenarios requiring long inputs. Previous efforts for adapting LLMs to a longer length usually requires fine-tuning with this target length (Full-length fine-tuning), suffering intensive training cost. To decouple train length from target length for efficient context window extension, we propose Positional Skip-wisE (PoSE) training that smartly simulates long inputs using a fixed context window. This is achieved by first dividing the original context window into several chunks, then designing distinct skipping bias terms to manipulate the position indices of each chunk. These bias terms and the lengths of each chunk are altered for every training example, allowing the model to adapt to all positions within target length. Experimental results show that PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning, with minimal impact on performance. Leveraging this advantage, we have successfully extended the LLaMA model to 128k tokens using a 2k training context window. Furthermore, we empirically confirm that PoSE is compatible with all RoPE-based LLMs and position interpolation strategies. Notably, our method can potentially support infinite length, limited only by memory usage in inference. With ongoing progress for efficient inference, we believe PoSE can further scale the context window beyond 128k.

PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training

The paper "PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training" by Dawei Zhu et al. presents a novel methodology for extending the context window of LLMs effectively and efficiently. This research addresses a crucial limitation in LLMs: the predefined context length, which poses significant constraints in applications requiring the processing of extremely long sequences.

Summary and Key Concepts

Background and Motivation

Traditional LLMs are constrained by their pre-defined context lengths, as evidenced in transformer models like GPT and LLaMA, which typically operate within a window of a few thousand tokens. This limitation hinders the performance of tasks requiring long sequence processing, such as document summarization, lengthy text retrieval, and in-context learning with multiple examples. Existing methods to extend this context length often rely on full-length fine-tuning, which is computationally prohibitive due to the quadratic increase in computational complexity associated with processing longer sequences.

Proposed Approach: Positional Skip-wise Training (PoSE)

The paper introduces PoSE, a fine-tuning strategy designed to decouple the training context size from the target context length. The methodology leverages a positional index manipulation technique to simulate long inputs using a fixed context window size. The core idea is to partition the original context into several chunks and then adjust their position indices by adding distinct skipping bias terms. This adjustment facilitates the model's exposure to a diverse range of relative positions within the target context window. By alternating these bias terms and chunk lengths for each training example, PoSE ensures the model can adapt to positions up to the target length without the intensive computational demands of full-length fine-tuning.

Experimental Results

The empirical evaluation demonstrates PoSE's efficacy on multiple fronts:

  1. Memory and Time Efficiency:
    • PoSE retains the fixed context window size during fine-tuning, significantly reducing both time and memory overhead compared to full-length fine-tuning. For example, scaling up LLaMA-7B from a context window of 2k to 16k requires substantially less computational resources with PoSE.
  2. LLMing Performance:
    • Evaluation on datasets such as GovReport and Proof-pile shows that PoSE achieves comparable perplexity scores to full-length fine-tuning while using a fraction of the computational resources. This indicates that PoSE's position manipulation scheme effectively preserves the model's language comprehension and generation capabilities for extended context lengths.
  3. Compatibility and Versatility:
    • PoSE has been empirically validated across different RoPE-based LLMs, including LLaMA, LLaMA2, GPT-J, and Baichuan models. It also works with various interpolation strategies like Linear, NTK, and YaRN, demonstrating its robustness and adaptability.
  4. Potential for Extreme Context Lengths:
    • PoSE shows promising results in extending context windows to extreme lengths, such as 96k and 128k tokens. The researchers tested models on long document datasets like Books3 and Gutenberg, achieving reasonable perplexity scores and demonstrating the potential to process exceedingly long sequences effectively.

Theoretical and Practical Implications

Theoretical Implications

The theoretical foundation of PoSE lies in its innovative use of positional indices to simulate longer contexts within a fixed window. This approach challenges the traditional dependency on long-sequence fine-tuning, providing a generalizable mechanism that can potentially support infinite lengths, given advancements in efficient inference techniques. The paper also contributes to the understanding of positional embedding schemes and their adaptation for length extrapolation, particularly in the context of RoPE-based models.

Practical Implications

Practically, PoSE offers a scalable solution for deploying LLMs in real-world applications where long-context processing is crucial. Its efficiency in memory and computation enables the utilization of large models on limited hardware, making it accessible for a broader range of applications. This capability is particularly beneficial for industries dealing with large volumes of textual data, such as legal document analysis, academic research, and large-scale digital archiving.

Future Directions

Given the promising results of PoSE, future research can explore several avenues:

  1. Optimizing Positional Interpolation: While the paper explores linear, NTK, and YaRN interpolation methods, further optimization of these techniques could enhance performance for even longer contexts.
  2. Broader Model Applicability: Extending PoSE’s applicability to other types of LLM architectures beyond RoPE-based models could be another fruitful direction.
  3. Inference Efficiency: Continuous advancements in efficient inference techniques like Flash Attention and xFormers can further reduce the inference phase's memory usage, pushing the boundaries of context length.
  4. Real-World Applications: Implementing PoSE in diverse real-world applications can validate its practical utility and inspire further enhancements based on application-specific requirements.

Conclusion

The PoSE methodology presents an efficient and effective approach for extending the context windows of LLMs. It addresses the computational challenges of full-length fine-tuning and demonstrates substantial potential for both theoretical advancements and practical applications in processing long sequences. The research's implications suggest a significant step forward in making LLMs more versatile and scalable, thereby expanding their utility in complex, real-world text processing tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. URL https://arxiv.org/abs/2309.10305.
  2. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091, 2022.
  5. Extending context window of large language models via positional interpolation. ArXiv, abs/2306.15595, 2023a.
  6. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023b.
  7. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
  8. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  9. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  2978–2988, 2019.
  10. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  11. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  12. Hugging Face. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  13. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  14. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  1382–1390, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.99.
  15. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1419–1436, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.112.
  16. kaiokendev. Things i’m learning while training superhot. https://kaiokendev.github.io/til#extending-context-to-8k, 2023.
  17. Winogrande: An adversarial winograd schema challenge at scale. 2019.
  18. Efficient memory management for large language model serving with pagedattention, 2023.
  19. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
  20. In-context learning with many demonstration examples. arXiv preprint arXiv:2302.04931, 2023.
  21. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, 2022.
  22. Landmark attention: Random-access infinite context length for transformers, 2023.
  23. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have, 2023.
  24. Yarn: Efficient context window extension of large language models, 2023.
  25. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  26. Shawn Presser. https://twitter.com/theshawwn/status/1320282149329784833, 2020.
  27. Jeffrey Quesnelle. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/, 2023.
  28. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019.
  29. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  30. Randomized positional encodings boost length generalization of transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  1889–1903, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.161.
  31. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  32. A length-extrapolatable transformer. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  14590–14604, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.816.
  33. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  35. Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170, 2023.
  36. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  37. Augmenting language models with long-term memory. arXiv preprint arXiv:2306.07174, 2023.
  38. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
  39. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, 2019.
  40. Proof-pile. https://github.com/zhangir-azerbayev/proof-pile, 2022.
  41. Fine-grained distillation for long document retrieval. arXiv preprint arXiv:2212.10423, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Dawei Zhu (46 papers)
  2. Nan Yang (182 papers)
  3. Liang Wang (512 papers)
  4. Yifan Song (48 papers)
  5. Wenhao Wu (71 papers)
  6. Furu Wei (291 papers)
  7. Sujian Li (82 papers)
Citations (63)