QwenLong-CPRS: Towards $\infty$-LLMs with Dynamic Context Optimization (2505.18092v2)

Published 23 May 2025 in cs.CL

Abstract: This technical report presents QwenLong-CPRS, a context compression framework designed for explicit long-context optimization, addressing prohibitive computation overhead during the prefill stage and the "lost in the middle" performance degradation of LLMs during long sequence processing. Implemented through a novel dynamic context optimization mechanism, QwenLong-CPRS enables multi-granularity context compression guided by natural language instructions, achieving both efficiency gains and improved performance. Evolved from the Qwen architecture series, QwenLong-CPRS introduces four key innovations: (1) Natural language-guided dynamic optimization, (2) Bidirectional reasoning layers for enhanced boundary awareness, (3) Token critic mechanisms with LLMing heads, and (4) Window-parallel inference. Comprehensive evaluations across five benchmarks (4K-2M word contexts) demonstrate QwenLong-CPRS's threefold effectiveness: (1) Consistent superiority over other context management methods like RAG and sparse attention in both accuracy and efficiency. (2) Architecture-agnostic integration with all flagship LLMs, including GPT-4o, Gemini2.0-pro, Claude3.7-sonnet, DeepSeek-v3, and Qwen2.5-max, achieves 21.59$\times$ context compression alongside 19.15-point average performance gains; (3) Deployed with Qwen2.5-32B-Instruct, QwenLong-CPRS surpasses leading proprietary LLMs by 4.85 and 10.88 points on Ruler-128K and InfiniteBench, establishing new SOTA performance.

Summary

Overview of QwenLong-CPRS

The paper introduces QwenLong-CPRS, a novel context compression framework specifically designed to enhance the efficiency and performance of LLMs when processing long sequences. This framework addresses two significant challenges: the prohibitive computation overhead during the prefill stage and the "lost in the middle" phenomenon, which are prevalent issues in contemporary LLM applications. Through dynamic context optimization, QwenLong-CPRS effectively compresses multi-granularity contexts as guided by natural language instructions, which results in both efficiency gains and improved performance.

Key Innovations

QwenLong-CPRS is evolved from the Qwen architectural series and incorporates four innovative features:

Natural Language-Guided Dynamic Optimization: This enables context compression tailored to specific user queries, ensuring that critical information is preserved while non-essential data is minimized.
Bidirectional Reasoning Layers: These layers enhance the model's ability to recognize and interpret boundaries within the context, effectively addressing issues of information prioritization that LLMs face in long-sequence tasks.
Token Critic Mechanisms: This involves a novel use of LLMing heads to score token importance, allowing the model to maintain precision while compressing context.
Window-Parallel Inference: This technique permits efficient processing by parallelizing inference across partitioned context windows, reducing latency and computational overhead significantly.

Evaluation and Results

Extensive evaluation across five benchmarks demonstrates the effectiveness of QwenLong-CPRS. The framework achieves:

Consistent superiority over other context management methods such as Retrieval-Augmented Generation (RAG) and sparse attention both in accuracy and efficiency.
Architecture-agnostic integration with several flagship LLMs, achieving significant context compression ratios and notable performance improvements.
New state-of-the-art (SOTA) performance metrics on Ruler-128K and InfiniteBench.

The model showcases substantial performance improvements even with smaller, short-context LLMs using QwenLong-CPRS configurations, outperforming larger long-context counterparts, which underlines its efficiency and potential for scalability.

Practical and Theoretical Implications

Practically, QwenLong-CPRS offers a scalable solution for deploying current LLMs in applications demanding long-context processing, ensuring models can operate with reduced computational resources without compromising performance. This could prove particularly useful in domains where extensive text processing is the norm, such as legal document analysis, the retrieval of specific knowledge from large databases, or multi-step reasoning tasks.

Theoretically, the framework contributes to the broader discourse on dynamic context management in LLMs, suggesting a potential paradigm shift towards more adaptive, instruction-guided context handling. Future developments in this area could lead to increasingly sophisticated models capable of processing complex contexts dynamically and efficiently.

Future Directions

The paper identifies several areas for future exploration, including improving computational efficiency through kernel operation optimization and global context awareness integration. Moreover, there is potential for QwenLong-CPRS to serve as a foundational component in diverse applications beyond current benchmarks, such as enabling enhanced reasoning capabilities in agent systems and advancing long-chain reasoning compression techniques.

Overall, QwenLong-CPRS presents a robust and adaptable approach to addressing existing limitations in long-context LLM processing, positioning itself as a vital contributor to upcoming advancements in AI contextual comprehension and computational efficiency.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1927014385798479880

https://twitter.com/HuggingPapers/status/1928970548970733747

https://twitter.com/TheTuringPost/status/1929690928060755971