Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Qwen2.5-1M Technical Report (2501.15383v1)

Published 26 Jan 2025 in cs.CL
Qwen2.5-1M Technical Report

Abstract: We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.

The Qwen2.5-1M technical report details the design, training, and deployment strategies of a series of LLMs engineered to handle up to one million tokens of context. The report systematically covers modifications to the training regime, architectural adaptations, and inference optimizations necessary for scaling context length while preserving performance on both long and short sequences.

The technical approach is divided into two main parts:

1. Long-Context Training and Post-Training

  • Pre-Training Strategy The models leverage a combination of natural and synthetic long-text data. Synthetic tasks such as Fill-in-the-Middle, keyword-based retrieval, and paragraph reordering are integrated into the corpus to enforce long-range dependency learning. A progressive context expansion schedule is adopted by initially training with 4096-token sequences and subsequently increasing the context window to 32K, 65K, 131K, and finally 262K tokens. During each phase, the model’s Rotary Positional Embedding (RoPE) base frequency is adjusted, ensuring that attention mechanisms effectively capture longer-range dependencies. Quantitative improvements on the RULER benchmark across these successive stages highlight a marked enhancement in the models’ ability to handle extensive context lengths.
  • Post-Training Enhancements To further bolster long-context performance without degrading capabilities on shorter inputs, the report describes a two-stage supervised fine-tuning (SFT) process. Initially, the model is fine-tuned exclusively on data with shorter sequences; subsequently, a mixed dataset containing sequences that range up to 262K tokens is introduced. Complementing this, an offline reinforcement learning (RL) phase—akin to Direct Preference Optimization—is applied. This RL stage leverages training pairs from short-context samples yet achieves generalization on long-context tasks, as evidenced by improvements on the longbench-chat benchmark. The synthesis of long instruction data, generated using an agent-based framework with retrieval-augmented generation and multi-hop reasoning, addresses the scarcity of human-annotated long instructions.

2. Efficient Inference and Deployment

  • Length Extrapolation via Dual Chunk Attention (DCA) The report introduces a training-free length extrapolation method based on Dual Chunk Attention, which decomposes the input into manageable chunks and remaps relative positions. By ensuring that inter-token distances do not exceed the maximum length seen during training, the method preserves the effectiveness of rotary position embeddings, even when processing contexts up to one million tokens. Complementary attention scaling techniques (referred to as YaRN) are applied to maintain precision and stability.
  • Sparse Attention and Memory Optimizations Recognizing that full attention computation is prohibitively expensive for ultra-long contexts, the report leverages a sparse attention mechanism inspired by MInference. This approach dynamically selects critical tokens—following a “Vertical-Slash” pattern—to reduce computational complexity while maintaining nearly identical accuracy compared to full attention. Integration with chunked prefill dramatically reduces VRAM usage (up to 96.7% savings for activation storage in MLP layers) by processing input sequences in smaller segmented chunks. Additionally, a sparsity refinement method based on an attention recall metric is used to calibrate the sparse configuration for 1M-token sequences.
  • Inference Engine Optimizations The deployment framework (open-sourced as part of BladeLLM and integrated with vLLM) incorporates kernel-level optimizations, including highly tuned sparse attention kernels and MoE kernel enhancements, which exploit advanced GPU architectures (e.g., NVIDIA Ampere/Hopper and AMD MI300) to achieve up to a 90% peak FLOPs utilization rate. Further efficiency is gained via dynamic chunked pipeline parallelism and the Totally Asynchronous Generator (TAG) scheduling system. These system-level optimizations collectively reduce the time-to-first-token by 3× to 7× in ultra-long-context scenarios across different model sizes and hardware platforms.

Evaluation and Performance

  • Long-Context Benchmarks The Qwen2.5-1M models are evaluated on rigorous tasks such as Passkey Retrieval (with 1M token documents) and extended versions of RULER and LV-Eval. Notably, the Qwen2.5-14B-Instruct-1M variant achieves accuracy levels exceeding 90% on 128K sequence samples, and even competitive retrieval accuracies in the 1M token regime demonstrate the effectiveness of the long-context training and inference strategies.
  • Short-Context Performance Despite the focus on extended context lengths, the models maintain competitive performance on standard benchmarks for natural language understanding, coding, mathematics, and reasoning. This balanced performance is crucial to ensuring that the enhancements for long-context processing do not impair the baseline capabilities.
  • Inference Speed Gains Extensive speed comparisons on GPUs (including NVIDIA H20 and A100) show dramatic reductions in inference latency. For example, the Qwen2.5-14B-Instruct-1M model reduces its prefill time from approximately 12 minutes (with full attention) to around 109 seconds when enhanced with sparse attention and optimized kernel computations.

In summary, the report presents a comprehensive framework for extending the context length of LLMs to one million tokens by innovating across the data synthesis, training, and inference pipelines. The work demonstrates that with carefully designed progressive training, specialized synthetic data, and advanced inference optimizations, it is feasible to significantly expand the operational context window without compromising accuracy or short-sequence performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (28)
  1. An Yang (32 papers)
  2. Bowen Yu (89 papers)
  3. Chengyuan Li (78 papers)
  4. Dayiheng Liu (75 papers)
  5. Fei Huang (408 papers)
  6. Haoyan Huang (1 paper)
  7. Jiandong Jiang (2 papers)
  8. Jianhong Tu (10 papers)
  9. Jianwei Zhang (114 papers)
  10. Jingren Zhou (198 papers)
  11. Junyang Lin (99 papers)
  12. Kai Dang (13 papers)
  13. Kexin Yang (28 papers)
  14. Le Yu (41 papers)
  15. Mei Li (41 papers)
  16. Minmin Sun (3 papers)
  17. Qin Zhu (11 papers)
  18. Rui Men (21 papers)
  19. Tao He (62 papers)
  20. Weijia Xu (23 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com