Papers
Topics
Authors
Recent
Search
2000 character limit reached

Extending LLMs' Context Window with 100 Samples

Published 13 Jan 2024 in cs.CL | (2401.07004v1)

Abstract: LLMs are known to have limited extrapolation ability beyond their pre-trained context window, constraining their application in downstream tasks with lengthy inputs. Recent studies have sought to extend LLMs' context window by modifying rotary position embedding (RoPE), a popular position encoding method adopted by well-known LLMs such as LLaMA, PaLM, and GPT-NeoX. However, prior works like Position Interpolation (PI) and YaRN are resource-intensive and lack comparative experiments to assess their applicability. In this work, we identify the inherent need for LLMs' attention entropy (i.e. the information entropy of attention scores) to maintain stability and introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window. We validate the superiority of our method in both fine-tuning performance and robustness across different context window sizes on various context-demanding tasks. Notably, our method extends the context window of LLaMA-2-7B-Chat to 16,384 with only 100 samples and 6 training steps, showcasing extraordinary efficiency. Finally, we also explore how data compositions and training curricula affect context window extension for specific downstream tasks, suggesting fine-tuning LLMs with lengthy conversations as a good starting point. We release our code and SFT data at https://github.com/GAIR-NLP/Entropy-ABF.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245.
  2. Etc: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  4. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
  5. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  6. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
  7. bloc97. 2023a. Add NTK-Aware interpolation "by parts" correction.
  8. bloc97. 2023b. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  10. Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062.
  11. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  12. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
  13. David Chiang and Peter Cholak. 2022. Overcoming a theoretical limitation of self-attention. arXiv preprint arXiv:2202.12172.
  14. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  15. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
  16. Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
  17. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  18. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  19. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  20. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.
  21. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492.
  22. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  23. Convolutional sequence to sequence learning. In International conference on machine learning, pages 1243–1252. PMLR.
  24. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  25. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112.
  26. kaiokendev. 2023. Things I’m learning while training superhot.
  27. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR.
  28. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466.
  29. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451.
  30. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  31. How long can open-source llms truly promise on context length?
  32. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091.
  33. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. Learning,Learning.
  34. Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300.
  35. Giraffe: Adventures in expanding context lengths in llms. arXiv preprint arXiv:2308.10882.
  36. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
  37. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  39. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
  40. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  41. {{\{{ZeRO-Offload}}\}}: Democratizing {{\{{Billion-Scale}}\}} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564.
  42. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
  43. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150.
  44. Jianlin Su. 2023. Rectified rotary position embeddings. https://github.com/bojone/rerope.
  45. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
  46. Do long-range language models actually use long-range context? arXiv: Computation and Language,arXiv: Computation and Language.
  47. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  50. Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170.
  51. Attention is all you need. Advances in neural information processing systems, 30.
  52. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
  53. Memorizing transformers. arXiv preprint arXiv:2203.08913.
  54. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039.
  55. Bp-transformer: Modelling long-range context via binary partitioning. arXiv: Computation and Language,arXiv: Computation and Language.
  56. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
  57. Judging llm-as-a-judge with mt-bench and chatbot arena.
Citations (10)

Summary

  • The paper proposes an innovative entropy-aware ABF method that extends the LLM context window to 16,384 tokens using only 100 samples.
  • It employs dynamic attention scaling and layer-specific adjustments to stabilize attention entropy across extended inputs.
  • Experiments on LLaMA-2-7B-Chat across 12 long-context tasks demonstrate superior efficiency and minimal resource requirements.

Extending LLMs' Context Window with Limited Data

Introduction

The paper addresses a critical limitation in LLMs: their restricted context window, which hampers performance in tasks needing extended input sequences. The authors propose an innovative extension to Rotary Position Embedding (RoPE), enhancing LLMs' ability to work with larger context windows. This study demonstrates a novel method that efficiently enlarges the context window, using minimal training data and computational resources while maintaining robust performance.

Methodology

The core contribution of the paper is the introduction of "entropy-aware ABF," a technique that combines adjusted base frequency (ABF) with a dynamic attention scalar. This approach aims to stabilize the information entropy of attention scores, which is crucial for maintaining model focus over longer inputs. The technique involves modifying RoPE's base frequency and scaling attention logits dynamically based on input position and layer-specific characteristics. This method is validated through fine-tuning and robustness tests across diverse context sizes and tasks.

Implementation and Experiments

The implementation of the proposed method involves:

  1. Dynamic Attention Scaling: Unlike fixed scaling factors used in previous methods, the technique introduces positional attention scaling, which adjusts according to the number of contextual tokens. This ensures that attention weights adapt based on input length variability.
  2. Layer-Dependent Adjustment: The scaling factor is not applied uniformly across all model layers. Instead, it targets specific layers exhibiting attention entropy stabilization, thereby preserving the model's inherent sequential processing patterns.
  3. Integration with ABF: The adjustment of RoPE's base frequency to a higher value enhances the model's generalization capability for longer sequences.

The experiments were conducted on LLaMA-2-7B-Chat, extending its context window to 16,384 with minimal training data (100 samples) and steps (6 steps). The efficiency of this method is showcased through superior performance in 12 long-context tasks from the LongBench benchmark.

Results

The results reveal that models utilizing the entropy-aware ABF method achieve higher performance metrics across different context window sizes (Figure 1), outperforming other extension methods such as PI, YaRN, and NTK variants in both efficiency and data utilization. Figure 1

Figure 1: Long-Context Performance of RoPE-extending Methods with Different Amounts of Training Data.

Moreover, the research highlights the robustness of the method across varying context lengths (Figure 2), demonstrating consistent improvements and maintaining performance even when directly applied to extended contexts not seen during training. Figure 2

Figure 2: Long-Context Performance of RoPE-extending Methods with Different Context Window Sizes.

Practical Implications and Future Work

The study's implications are significant for applications requiring long-context understanding, such as document summarization, code completion, and few-shot learning. By dramatically reducing the training resources needed to extend the context window, this method opens avenues for more practical deployments in resource-constrained environments.

Future research can explore the integration of this scaling approach with other attention-efficient architectures and investigate its applicability to even more complex organizational tasks involving multi-document processing or extensive collaborative inputs.

Conclusion

The proposed "entropy-aware ABF" method marks a substantial advancement in addressing the context window limitation of LLMs. By ensuring efficient use of data and minimal resource requirements, this approach not only extends the usability of LLMs but also sets the stage for future innovations in large-scale contextual processing.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 297 likes about this paper.