Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Published 13 May 2026 in cs.CV | (2605.13831v1)

Abstract: Long-context modeling is becoming a core capability of modern large vision-LLMs (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-LLMs.

Abstract PDF Upgrade to Chat

Authors (12)

Summary

The paper introduces continued pre-training (LongPT) to extend LVLM context from 32K to 128K tokens, achieving a 7.1% absolute improvement in VQA scores.
It leverages a long-document VQA synthesis pipeline and a balanced extraction-to-reasoning ratio to enhance retrieval and multimodal reasoning capabilities.
The approach preserves short-context performance while generalizing robustly across diverse tasks and even operating beyond its training context window.

Effective Training of Long-Context Vision-LLMs with Generalization Beyond 128K Context

Introduction

Scaling large vision-LLMs (LVLMs) to handle long multimodal contexts is key for enabling advanced applications such as long-document understanding, extended video analysis, and agentic workflows that require persistent context management. The methodological core of this study is a systematic investigation into continued pre-training (“LongPT”) for extending the context window of an open-weight LVLM (Qwen2.5-VL-7B) from 32K to 128K tokens, aiming to delineate how data composition, mixture strategy, and training protocols affect long-context generalization.

Design and Curation of Long-Context Multimodal Data

Document Pool and Token-Length Distribution

The success of long-context capability in LVLMs critically depends on constructing and curating large-scale, diverse, high-fidelity multimodal document pools. The pool used here comprises over 1.5 million PDF documents, spanning numerous domains such as science, engineering, and the social sciences. The domain distribution is illustrated in Figure 1.

Figure 1: The document domain distribution of the document pool used for data synthesis.

These documents are rendered into images and parsed using an OCR expert model that produces fine-grained, layout-aware blocks, facilitating granular and structure-aware downstream data synthesis.

Long-Document VQA Synthesis Pipeline

The authors identify long-document visual question-answering (VQA) as a superior data synthesis strategy relative to dense OCR transcription. The long-document VQA pipeline takes document segments (8–15 pages), generates targeted QA pairs using an LVLM (with evidence page annotation), and embeds the QA back in the full document context, creating instances where answer evidence is localized but retrieval requires global context scanning.

The distribution of the VQA training data in the pool-native setup is detailed in Figure 2.

Figure 2: Long-document VQA data in the pool-native distribution with lengths spanning 32K–128K tokens.

Three VQA task classes are synthesized: (i) single-page extraction, (ii) multi-page extraction, and (iii) reasoning, with the latter requiring non-trivial aggregation or computation across multiple pages or document modalities.

Comparison of Supervision Sources

Direct comparison demonstrates that long-document VQA supervision leads to substantially higher downstream VQA performance than OCR transcription, even when the latter is followed by short-context SFT. The instruction-formatted, evidence-localized, and task-diverse signals in VQA enable large retrieval and reasoning gains, and, critically, show that high-quality long-context data can preserve short-context capabilities even in the absence of explicit short-context mixing.

Training Methodology

The model is extended to 128K using the Dynamic-NTK approach for scaling RoPE base frequency and trained under a fixed 5B-token budget. Experiments focus on three axes:

Sequence-Length Distribution

A key finding is that balanced, naturally occurring (“pool-native”) sequence length distributions outperform long-biased distributions where the majority of sequences are near the upper context bound (≥100K tokens). Empirically, generalizability across arbitrary lengths and positions is superior, with pool-native sampling yielding up to 1.7 points higher VQA scores in some settings.

Multi-Task Long-Context Mixtures

Grid search over extraction-to-reasoning ratios reveals that moderately extraction-heavy balancing (8:2 extraction to reasoning) produces optimal long-context performance. This supports the claim that key-information retrieval, not purely complex reasoning, is the limiting factor for long-context LVLMs, but modest reasoning data is necessary to provide regularization and avoid overfitting to pure retrieval.

Short-Context Data Mixing

Contrary to practices in LLM long-context pre-training, ablations show that pure long-document VQA training preserves short-context skills with negligible performance degradation (∼1 point drop on short-context benchmarks), given that the data itself is instruction-following and diverse.

Empirical Results

Long-Document VQA

The final recipe yields 7.1% absolute improvement over the Qwen2.5-VL-7B baseline (from 50.59 to 57.70 average VQA score) in the 64K–128K range, outperforming open-source LVLMs up to 14B parameters and rivaling some 32–72B models. Notably, the model generalizes beyond its training context window, with only mild performance decay at 256K and 512K contexts (retaining over 52% macro-average scores at 512K, compared to baseline collapse).

Generalization to Diverse Multimodal Long-Context Tasks

The model demonstrates robust transfer to various evaluation settings:

On the MM-NIAH benchmark (webpage-based multimodal needle retrieval), average performance more than doubles the 7B baseline (from 20.0 to 49.4 at 128K).
On VTCBench (long-context vision-text compression), overall scores jump by nearly 5 points, with improvements in reasoning and memory under compressed contexts.
For long-video understanding (Video-MME, MLVU, LongVideoBench), the approach yields consistent gains over strong baselines without using video-specific data.

Practical and Theoretical Implications

This work provides a data-efficient blueprint for extending LVLM context beyond 128K using a modest (5B-token) budget, emphasizing that:

Instruction-centric, retrieval-heavy, and heterogeneous VQA signals are the most effective for robust long-context generalization.
Long-context ability is not a discrete feature acquired only at the maximum training length; it emerges from broad calibration over variable sequence lengths and positions.
Short-context capability is retained without explicit mixture if the long-context data is well-formatted and diverse. This calls into question prevailing practices of maintaining high proportions of short data during continued pre-training.
The recipe demonstrates strong backbone transferability: when applied to Qwen3-VL-8B (already containing native long-context training), performance still improves on unseen evaluation, indicating architectural robustness.

Future Directions

Open directions include scaling these ablation findings to larger model sizes (e.g., 30B, 70B LVLMs), pushing context length to 1M or more, and devising lower-cost, robust evaluation protocols capable of handling the diversity of open-ended multimodal tasks.

Conclusion

This study provides clear empirical and methodological evidence for how to extend LVLMs to very long contexts (128K and beyond), achieving substantial improvements on document VQA, retrieval, and multimodal reasoning tasks, while maintaining or improving generalization. The data synthesis and mixture strategies developed here set a rigorously validated foundation for future multimodal long-context model design and training (2605.13831).

Markdown Report Issue