Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

72 tokens/sec

GPT-4o

61 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

244 1 1

Long-Context Language Modeling with Parallel Context Encoding (2402.16617v2)

Published 26 Feb 2024 in cs.CL

Abstract: Extending LLMs to process longer inputs is crucial for a wide range of applications. However, the substantial computational cost of transformers and limited generalization of positional encoding restrict the size of their context window. We introduce Context Expansion with Parallel Encoding (CEPE), a framework that can be applied to any existing decoder-only LLMs to extend their context window. CEPE employs a small encoder to process long inputs chunk by chunk, enabling the frozen decoder to utilize additional contexts via cross-attention. CEPE is efficient, generalizable, and versatile: trained with 8K-token documents, it extends the context window of LLAMA-2 to 128K tokens, offering 10x the throughput with only 1/6 of the memory. CEPE yields strong performance on LLMing and in-context learning. CEPE also excels in retrieval-augmented applications, while existing long-context models degenerate with retrieved contexts. We further introduce a CEPE variant that can extend the context window of instruction-tuned models using only unlabeled data, and showcase its effectiveness on LLAMA-2-CHAT, leading to a strong instruction-following model that can leverage very long contexts on downstream tasks.

PDF HTML Abstract

Enhancing Context Window in LLMs with CEPE Framework

Introduction

The paper introduces Context Expansion with Parallel Encoding (CEPE), a novel framework devised to augment the context handling capabilities of existing LLMs. This initiative responds to the imperative need for LLMs to parse and comprehend extended contexts, which is essential for a multitude of complex tasks. These tasks range from summarizing extensive documents to answering questions derived from broad compilations of web pages. However, the architectural and computational limitations inherent to the transformer models, alongside the constraints imposed by positional encoding generalization, have traditionally posed challenges to processing long sequences efficiently.

CEPE Architecture

CEPE introduces a two-fold strategy: incorporating a compact encoder for chunk-based long input processing and inserting a cross-attention module within the decoder layers for enriched context understanding. This setup architecturally diverges from decoder-only models by integrating parallel encoding processes that ensure both efficiency and efficacy in handling extended contexts. The encoder encodes segmented inputs, which are then paralleled through a cross-attention mechanism in the decoder, ensuring the model scales with the input length without a drastic increase in computational cost.

Efficiency and Versatility

The introduction of CEPE marks a significant leap in efficiency and versatility for extending context windows in LLMs. Notably, CEPE achieves a marked increase in throughput and a decrease in memory usage when extending the LLaMA-2 model's context window up to 128K tokens. This capability is contrasted against the standard decoding process, which sees a linear increase in memory consumption proportional to the input length. The parallel processing of context chunks and the selective tuning of the encoder and cross-attention modules considerably reduce the computational overhead, making CEPE a practical solution for large-scale deployment.

Practical Applications and Performance

CEPE's utility is demonstrated across a range of tasks, showing notable performance improvements in LLMing, in-context learning, and retrieval-augmented applications. For LLMing, CEPE significantly outperforms existing methods in processing longer inputs with vastly improved efficiency. In retrieval-augmented settings, where leveraging external documents becomes necessary, CEPE exhibits exceptional performance by incorporating more retrieved documents without degradation in output quality. Furthermore, the paper introduces CEPE-Distilled (CEPED) variant, meant to augment instruction-tuned models for better performance on downstream tasks involving long texts, all while utilizing unlabeled data for model extension.

Future Directions

The paper posits CEPE as an enabling technology for future LLM research, focusing on cheap and effective strategies for context extension. While CEPE has shown remarkable improvements in the existing model's ability to handle extended contexts efficiently, possible areas for enhancement include the exploration of different encoder sizes, learning rates, and data mixtures. Moreover, the application of CEPE to a broader array of instruction-tuned models presents an intriguing avenue for further exploration.

Conclusion

The CEPE framework represents a substantial advancement in the capabilities of LLMs to process and understand extended contexts. By strategically modifying the transformer architecture to incorporate a parallel encoding mechanism, CEPE not only improves efficiency and reduces computational costs but also extends the practical usability of LLMs in handling complex tasks involving vast amounts of data. As LLM applications continue to expand, frameworks like CEPE will play a pivotal role in unlocking new potentials and overcoming existing limitations.

PDF Markdown Bookmark Chat (Pro)

References (73)

Authors (3)

Howard Yen (10 papers)
Tianyu Gao (35 papers)
Danqi Chen (84 papers)

Citations (32)

View on Semantic Scholar

GitHub

GitHub - princeton-nlp/CEPE (112 stars)

Tweets

https://twitter.com/gaotianyu1350/status/1842237305286771027

https://twitter.com/HowardYen1/status/1762474556101661158

https://twitter.com/SadhikaMalladi/status/1789239713229898229

https://twitter.com/Hellomatico/status/1773469952177029192

https://twitter.com/mctalentowen/status/1801416597929922804

YouTube

Show All Videos