Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extending LLMs' Context Window with 100 Samples (2401.07004v1)

Published 13 Jan 2024 in cs.CL
Extending LLMs' Context Window with 100 Samples

Abstract: LLMs are known to have limited extrapolation ability beyond their pre-trained context window, constraining their application in downstream tasks with lengthy inputs. Recent studies have sought to extend LLMs' context window by modifying rotary position embedding (RoPE), a popular position encoding method adopted by well-known LLMs such as LLaMA, PaLM, and GPT-NeoX. However, prior works like Position Interpolation (PI) and YaRN are resource-intensive and lack comparative experiments to assess their applicability. In this work, we identify the inherent need for LLMs' attention entropy (i.e. the information entropy of attention scores) to maintain stability and introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window. We validate the superiority of our method in both fine-tuning performance and robustness across different context window sizes on various context-demanding tasks. Notably, our method extends the context window of LLaMA-2-7B-Chat to 16,384 with only 100 samples and 6 training steps, showcasing extraordinary efficiency. Finally, we also explore how data compositions and training curricula affect context window extension for specific downstream tasks, suggesting fine-tuning LLMs with lengthy conversations as a good starting point. We release our code and SFT data at https://github.com/GAIR-NLP/Entropy-ABF.

Introduction to LLM Context Window Extension

LLMs like GPT-3 have shown exceptional ability in generating coherent and contextually relevant text. However, their capability is inherently constrained by the size of their context window – the amount of text they can consider at any given time. While LLMs are typically pre-trained on a fixed size, real-world applications often require processing much longer texts. This research focuses on overcoming the context window size limitation in LLMs, which is crucial for tasks that demand a broader understanding of context, such as summarizing long documents or maintaining lengthy conversations.

Rotary Position Embedding (RoPE)

A critical aspect of the current generation of LLMs is position encoding, which helps the models understand the order of words. Rotary Position Embedding (RoPE) is a popular method for encoding positional information in state-of-the-art LLMs. It encodes positions into the input embeddings through a rotational transformation in the complex plane. This strategy allows the models to maintain the relative order of words or tokens, an essential factor for generating coherent outputs.

Extending the Context Window

Previous efforts to extend context window sizes of LLMs have been resource-intensive and lack comprehensive comparative analysis. In this work, the researchers present a novel method to extend LLMs' context window beyond the pre-trained limit by adjusting RoPE's base frequency and scaling attention logits, thereby enabling LLMs to adapt to larger context windows more efficiently. They focus on maintaining attention entropy, a measure of the randomness in the distribution of attention scores. The method, termed 'entropy-aware ABF', extends the context window with remarkable efficiency: using only 100 samples and six training steps, it enhanced the window of the model LLaMA-2-7B-Chat to 16,384 tokens. This approach outperforms existing methods across different window sizes and on various context-demanding tasks.

Practical Implications and Dataset Efficiency

This paper's findings have potential implications for the use of LLMs in real-world applications that require handling long texts. Notably, the method demonstrated extraordinary efficiency with minimal training samples, which significantly reduces the computational resources required for model fine-tuning. The researchers also explore optimal training datasets and curricula for specific tasks, providing practical recommendations for extending the context window of LLMs in various applications.

By addressing the performance, robustness across various context window sizes, and resource efficiency, this research makes a significant contribution to enhancing the applicability of LLMs. The released code and supervised fine-tuning (SFT) data further enable replication and adoption of the proposed method by the broader research community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245.
  2. Etc: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  4. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
  5. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  6. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
  7. bloc97. 2023a. Add NTK-Aware interpolation "by parts" correction.
  8. bloc97. 2023b. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  10. Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062.
  11. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  12. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
  13. David Chiang and Peter Cholak. 2022. Overcoming a theoretical limitation of self-attention. arXiv preprint arXiv:2202.12172.
  14. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  15. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
  16. Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
  17. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  18. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  19. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  20. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.
  21. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492.
  22. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  23. Convolutional sequence to sequence learning. In International conference on machine learning, pages 1243–1252. PMLR.
  24. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  25. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112.
  26. kaiokendev. 2023. Things I’m learning while training superhot.
  27. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR.
  28. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466.
  29. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451.
  30. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  31. How long can open-source llms truly promise on context length?
  32. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091.
  33. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. Learning,Learning.
  34. Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300.
  35. Giraffe: Adventures in expanding context lengths in llms. arXiv preprint arXiv:2308.10882.
  36. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
  37. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  39. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  40. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  41. {{\{{ZeRO-Offload}}\}}: Democratizing {{\{{Billion-Scale}}\}} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564.
  42. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
  43. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150.
  44. Jianlin Su. 2023. Rectified rotary position embeddings. https://github.com/bojone/rerope.
  45. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
  46. Do long-range language models actually use long-range context? arXiv: Computation and Language,arXiv: Computation and Language.
  47. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  50. Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170.
  51. Attention is all you need. Advances in neural information processing systems, 30.
  52. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
  53. Memorizing transformers. arXiv preprint arXiv:2203.08913.
  54. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039.
  55. Bp-transformer: Modelling long-range context via binary partitioning. arXiv: Computation and Language,arXiv: Computation and Language.
  56. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
  57. Judging llm-as-a-judge with mt-bench and chatbot arena.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yikai Zhang (41 papers)
  2. Junlong Li (22 papers)
  3. Pengfei Liu (191 papers)
Citations (10)
Github Logo Streamline Icon: https://streamlinehq.com