Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long-Context Language Modeling with Parallel Context Encoding (2402.16617v2)

Published 26 Feb 2024 in cs.CL
Long-Context Language Modeling with Parallel Context Encoding

Abstract: Extending LLMs to process longer inputs is crucial for a wide range of applications. However, the substantial computational cost of transformers and limited generalization of positional encoding restrict the size of their context window. We introduce Context Expansion with Parallel Encoding (CEPE), a framework that can be applied to any existing decoder-only LLMs to extend their context window. CEPE employs a small encoder to process long inputs chunk by chunk, enabling the frozen decoder to utilize additional contexts via cross-attention. CEPE is efficient, generalizable, and versatile: trained with 8K-token documents, it extends the context window of LLAMA-2 to 128K tokens, offering 10x the throughput with only 1/6 of the memory. CEPE yields strong performance on LLMing and in-context learning. CEPE also excels in retrieval-augmented applications, while existing long-context models degenerate with retrieved contexts. We further introduce a CEPE variant that can extend the context window of instruction-tuned models using only unlabeled data, and showcase its effectiveness on LLAMA-2-CHAT, leading to a strong instruction-following model that can leverage very long contexts on downstream tasks.

Enhancing Context Window in LLMs with CEPE Framework

Introduction

The paper introduces Context Expansion with Parallel Encoding (CEPE), a novel framework devised to augment the context handling capabilities of existing LLMs. This initiative responds to the imperative need for LLMs to parse and comprehend extended contexts, which is essential for a multitude of complex tasks. These tasks range from summarizing extensive documents to answering questions derived from broad compilations of web pages. However, the architectural and computational limitations inherent to the transformer models, alongside the constraints imposed by positional encoding generalization, have traditionally posed challenges to processing long sequences efficiently.

CEPE Architecture

CEPE introduces a two-fold strategy: incorporating a compact encoder for chunk-based long input processing and inserting a cross-attention module within the decoder layers for enriched context understanding. This setup architecturally diverges from decoder-only models by integrating parallel encoding processes that ensure both efficiency and efficacy in handling extended contexts. The encoder encodes segmented inputs, which are then paralleled through a cross-attention mechanism in the decoder, ensuring the model scales with the input length without a drastic increase in computational cost.

Efficiency and Versatility

The introduction of CEPE marks a significant leap in efficiency and versatility for extending context windows in LLMs. Notably, CEPE achieves a marked increase in throughput and a decrease in memory usage when extending the LLaMA-2 model's context window up to 128K tokens. This capability is contrasted against the standard decoding process, which sees a linear increase in memory consumption proportional to the input length. The parallel processing of context chunks and the selective tuning of the encoder and cross-attention modules considerably reduce the computational overhead, making CEPE a practical solution for large-scale deployment.

Practical Applications and Performance

CEPE's utility is demonstrated across a range of tasks, showing notable performance improvements in LLMing, in-context learning, and retrieval-augmented applications. For LLMing, CEPE significantly outperforms existing methods in processing longer inputs with vastly improved efficiency. In retrieval-augmented settings, where leveraging external documents becomes necessary, CEPE exhibits exceptional performance by incorporating more retrieved documents without degradation in output quality. Furthermore, the paper introduces CEPE-Distilled (CEPED) variant, meant to augment instruction-tuned models for better performance on downstream tasks involving long texts, all while utilizing unlabeled data for model extension.

Future Directions

The paper posits CEPE as an enabling technology for future LLM research, focusing on cheap and effective strategies for context extension. While CEPE has shown remarkable improvements in the existing model's ability to handle extended contexts efficiently, possible areas for enhancement include the exploration of different encoder sizes, learning rates, and data mixtures. Moreover, the application of CEPE to a broader array of instruction-tuned models presents an intriguing avenue for further exploration.

Conclusion

The CEPE framework represents a substantial advancement in the capabilities of LLMs to process and understand extended contexts. By strategically modifying the transformer architecture to incorporate a parallel encoding mechanism, CEPE not only improves efficiency and reduces computational costs but also extends the practical usability of LLMs in handling complex tasks involving vast amounts of data. As LLM applications continue to expand, frameworks like CEPE will play a pivotal role in unlocking new potentials and overcoming existing limitations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. A general language assistant as a laboratory for alignment.
  2. Proofpile: A pre-training dataset of mathematical texts.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  4. Unlimiformer: Long-range transformers with unlimited length input. In Advances in Neural Information Processing Systems (NeurIPS).
  5. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning (ICML), volume 162, pages 2206–2240.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS).
  7. Efficient intent detection with dual sentence encoders.
  8. SummScreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602–8615, Dublin, Ireland. Association for Computational Linguistics.
  9. Extending context window of large language models via positional interpolation.
  10. Longlora: Efficient fine-tuning of long-context large language models.
  11. Adapting language models to compress contexts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3829–3846, Singapore. Association for Computational Linguistics.
  12. Rethinking attention with performers. In International Conference on Learning Representations.
  13. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, Online. Association for Computational Linguistics.
  14. BERT: Pre-training of deep bidirectional Transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (NAACL).
  15. Data engineering for scaling language models to 128k context.
  16. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore. Association for Computational Linguistics.
  17. Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces.
  18. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations.
  19. REALM: Retrieval-augmented language model pre-training. In International Conference on Machine Learning (ICML).
  20. Prototypical calibration for few-shot learning of language models. In The Eleventh International Conference on Learning Representations.
  21. The curious case of neural text degeneration. In International Conference on Learning Representations (ICLR).
  22. Surface form competition: Why the highest probability answer isn’t always right. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  23. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436, Online. Association for Computational Linguistics.
  24. Efficient Long-Text Understanding with Short-Text Models. Transactions of the Association for Computational Linguistics, 11:284–299.
  25. Camels in a changing climate: Enhancing lm adaptation with tulu 2.
  26. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
  27. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  28. Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  29. Repeat after me: Transformers are better than state space models at copying.
  30. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  31. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  32. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
  33. BOOKSUM: A collection of datasets for long-form narrative summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6536–6558, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  34. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  35. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1311–1316, Hong Kong, China. Association for Computational Linguistics.
  36. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy. Association for Computational Linguistics (ACL).
  37. Ra-dit: Retrieval-augmented dual instruction tuning.
  38. Benchmarking natural language understanding services for building conversational agents. ArXiv, abs/1903.05566.
  39. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
  40. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  41. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.
  42. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Association for Computational Linguistics (ACL), pages 9802–9822, Toronto, Canada. Association for Computational Linguistics.
  43. Nonparametric masked language modeling. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2097–2118, Toronto, Canada. Association for Computational Linguistics.
  44. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744.
  45. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 115–124, Ann Arbor, Michigan. Association for Computational Linguistics.
  46. QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5336–5358, Seattle, United States. Association for Computational Linguistics.
  47. Yarn: Efficient context window extension of large language models.
  48. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations (ICLR).
  49. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations.
  50. Exploring the limits of transfer learning with a unified text-to-text Transformer. The Journal of Machine Learning Research (JMLR), 21(140).
  51. Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6383–6402, Toronto, Canada. Association for Computational Linguistics.
  52. Ohad Rubin and Jonathan Berant. 2023. Long-range language modeling with self-retrieval.
  53. ZeroSCROLLS: A zero-shot benchmark for long text understanding. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7977–7989, Singapore. Association for Computational Linguistics.
  54. SCROLLS: Standardized CompaRison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12007–12021, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  55. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning (ICML).
  56. Replug: Retrieval-augmented black-box language models.
  57. Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing (EMNLP).
  58. Roformer: Enhanced transformer with rotary position embedding.
  59. Stanford Alpaca: An Instruction-following LLaMA model.
  60. Together. 2023a. Llama-2-7b-32k.
  61. Together. 2023b. Redpajama: An open source recipe to reproduce llama training dataset.
  62. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
  63. Llama 2: Open foundation and fine-tuned chat models.
  64. Attention is all you need. Advances in Neural Information Processing Systems (NIPS), 30.
  65. Ellen M. Voorhees and Dawn M. Tice. 2000. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’00, page 200–207, New York, NY, USA. Association for Computing Machinery.
  66. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  67. A dataset of python files from github. https://github.com/huggingface/blog/blob/main/codeparrot.md?version=codeparrot/codeparrot-valid-v2-near-dedup.
  68. Transformers: State-of-the-art natural language processing. In Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations.
  69. Efficient streaming language models with attention sinks.
  70. Effective long-context scaling of foundation models.
  71. Character-level convolutional networks for text classification. In Neural Information Processing Systems.
  72. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12697–12706. PMLR.
  73. QMSum: A new benchmark for query-based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5905–5921, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Howard Yen (10 papers)
  2. Tianyu Gao (35 papers)
  3. Danqi Chen (84 papers)
Citations (32)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com