Docopilot: Long-Context Multimodal Document Understanding

Updated 4 July 2026

Docopilot is a native long-context multimodal model designed for comprehensive document-level understanding using both text and image inputs.
It employs a ViT–MLP–LLM architecture enhanced by multimodal data packing, Ring Attention, and Liger Kernel for efficient cross-page reasoning.
Evaluations demonstrate that Docopilot outperforms traditional RAG methods in multi-page tasks while preserving document continuity and reducing latency.

Docopilot is a native long-context multimodal model for document-level understanding introduced together with Doc-750K, a document-level dataset designed for complex, multi-page comprehension without retrieval-augmented generation. It targets a specific failure mode of multimodal LLMs: robust reasoning over long, structured documents with cross-page dependencies, multi-turn follow-up questions, and mixed text–image evidence. The system is positioned as an end-to-end alternative to document RAG pipelines, with the central claim that high-quality document-level supervision and long-context multimodal modeling can preserve document continuity better than fragmented retrieval contexts while avoiding retrieval-stage latency and error accumulation (Duan et al., 19 Jul 2025).

1. Problem setting and motivation

Docopilot addresses document-level multimodal understanding rather than image-level or single-page document QA. The motivating problem is that most open multimodal LLMs are trained primarily on image-level tasks and degrade beyond approximately 8K tokens, which limits reasoning over multi-page documents and multi-turn interactions. The paper identifies long scientific and technical documents as a particularly difficult regime because answers may depend on cross-page references, backward queries across sections, counting across pages, and section-specific retrieval within super-long inputs (Duan et al., 19 Jul 2025).

A central design decision is the rejection of retrieval-augmented generation as the primary solution for this setting. Three limitations are stated explicitly. First, retrieval can fragment the original document context and destroy document structure. Second, multi-stage pipelines accumulate errors across retrieval and generation. Third, retrieval introduces extra time costs that are particularly unfavorable for multi-turn production use. The model is therefore constructed to process the document in one pass, using multimodal inputs and an extended context window rather than a retrieve-then-generate decomposition.

The paper attributes the main bottleneck not to the absence of more elaborate retrieval systems, but to the lack of sufficiently large and high-quality document-level multimodal supervised fine-tuning data. Existing multi-page datasets are described as either injecting irrelevant synthetic questions or underrepresenting cross-page dependency coverage. This framing is essential to the project: Docopilot is not presented merely as an architecture, but as the joint outcome of a new dataset, a long-context training recipe, and inference optimizations.

2. Doc-750K dataset

Doc-750K is the data substrate on which Docopilot is trained. It aggregates multimodal documents predominantly from OpenReview, arXiv, and Sci-Hub, with content concentrated in scientific articles and reviews containing text, figures, tables, charts, and forms. The dataset provides two input modalities. In the interleaved text-image format, text is parsed directly from PDF or HTML using MinerU and image references are inserted inline as sequences such as "<text>\n<image>\n…". In the multi-image format, each page is rendered as a page image in sequences such as "<image>\n<image>\n…". The first format prioritizes text semantics; the second preserves layout, pagination, and visual structure (Duan et al., 19 Jul 2025).

The reported scale is 758K questions across 251K conversations, with 3.1M document images. Of these conversations, 87K are multi-turn and 164K are single-turn. The average conversation contains 11,245 text tokens and 6,178 image tokens. The dataset also emphasizes document realism: 31.6% of QA pairs are described as “real” and in-depth question-answer pairs derived from original documents rather than injected or generic prompts. Only 4.8% of the total data is synthetic, and a sample of 500 synthetic QAs was reviewed, with 498/500 reported as relevant (Duan et al., 19 Jul 2025).

The dataset’s composition and annotation procedures are heterogeneous but rule-governed. For OpenReview, “reliable QA” review-reply pairs are extracted and formatted into conversations. For arXiv and Sci-Hub, deterministic rules aligned with document hierarchy are used to compile tasks such as abstract writing, paper titling, caption writing for figures and tables, experiments writing, and translation. For other documents, GPT-4o generates QA pairs under a strict prompt requiring specificity, figure references such as “Figure 1,” and explicit grounding to the original text. LLM-generated content is tagged in metadata.

Cross-page dependency is one of the defining dataset properties. The paper highlights multi-page reading, backward queries, multi-page counting, and section-specific retrieval within super-long documents, and supplements these claims with qualitative examples. Domain coverage is nonetheless skewed: academic papers account for approximately 32.6% of the total, and overall multimodal data accounts for approximately 88.8%. The paper further notes that bounding boxes and reading order are not provided as metadata; instead, the multi-image representation preserves layout visually and delegates OCR to the model, while the interleaved representation preserves semantics but sacrifices some layout fidelity.

Property	Value	Notes
Questions	758K	Across 251K conversations
Conversations	251K	87K multi-turn, 164K single-turn
Document images	3.1M	Multi-image pages
Average tokens/conversation	11,245 text; 6,178 image	Reported in Table 1
“Real” QA pairs	31.6%	Derived from original documents
Synthetic share	4.8%	Explicitly marked

A plausible implication is that Doc-750K functions as a proxy-task mixture for document-scale supervision rather than as a single benchmark-style QA corpus. That interpretation is supported by the coexistence of review-reply dialogue, summarization-like tasks, captioning, translation, and question answering within the same training source.

3. Model architecture and long-context training

Docopilot follows a ViT–MLP–LLM pattern: a pre-trained Vision Transformer, a two-layer MLP projector that aligns visual features to the LLM token space, and a pre-trained LLM. Two concrete variants are reported. Docopilot-2B combines InternViT-300M with InternLM2-1.8B. Docopilot-8B combines InternViT-300M with InternLM2.5-7B (Duan et al., 19 Jul 2025).

The model does not introduce a dedicated hierarchical document encoder, sparse attention mechanism, or explicit 2D layout embeddings. Instead, page order and layout are modeled implicitly through multi-image input and concatenation. Cross-page reasoning is delegated to standard Transformer attention in a long context, while systems optimizations make that regime tractable. The paper is explicit that attention remains standard Transformer attention with $O(n^2)$ sequence complexity; Ring Attention reduces memory pressure through distributed key-value blocks but does not change the asymptotic complexity claim.

Three implementation components are central. Multimodal Data Packing packs multiple samples into long sequences to reduce padding and balance compute between vision and language modules while enforcing sample-local attention isolation during training. Packing respects thresholds on image count $T_{\text{img}}$ , token count $T_{\text{tok}}$ , and maximum number of samples $M$ per packed item. A priority-queue packing strategy sorts samples by image and token counts and merges them when thresholds are respected. Ring Attention partitions sequences into blocks across devices and overlaps communication of key-value blocks with attention computation. Liger Kernel provides Triton kernels for kernel fusion, in-place operations, and chunking to improve throughput and reduce memory use.

Training mixes next-token prediction on interleaved contexts with supervised QA on Doc-750K proxy tasks. The reported generative objective is the standard cross-entropy loss

$L_{QA} = -\sum_t \log p(y_t \mid x).$

The paper does not report contrastive alignment or InfoNCE in supervised fine-tuning. Both model sizes are fine-tuned for one epoch with batch size 128, AdamW, learning rate $1\times10^{-5}$ , a cosine schedule, and weight decay 0.01 for the 2B model and 0.05 for the 8B model. The training limits include maximum sequence length 32K tokens, maximum tile count 24, tile resolution 448, and image threshold 48. Dynamic high-resolution tiling is used to enhance OCR on documents (Duan et al., 19 Jul 2025).

At inference time, Docopilot accepts both multi-image input and interleaved text-image input. For multi-page benchmarks, adjacent pages are vertically concatenated into a single image, with a maximum total of 18 images. This reduces image patch overhead while preserving page order. The architecture therefore remains comparatively conservative at the modeling level and aggressive at the systems level: the paper’s novelty lies less in a new attention operator than in the pairing of long-context end-to-end processing with document-scale multimodal supervision.

4. Evaluation and empirical results

The evaluation covers multi-page VQA, interleaved long-context QA, and single-page document QA. Reported benchmarks include MP-DocVQA with ANLS, MMLongBench-Doc with Accuracy and F1 judged by GPT-4o, DocGenome with classification accuracy and edit distances, MM-NIAH with accuracy across Short $(0\text{–}8K]$ , Medium $(8K\text{–}32K]$ , and Long $(32K\text{–}64K]$ context regimes, and single-page benchmarks DocVQA, ChartQA, and InfoVQA (Duan et al., 19 Jul 2025).

The strongest headline result is on MM-NIAH. Docopilot-8B reaches 61.8 overall accuracy versus 42.9 for InternVL2-8B, and surpasses InternVL2-26B at 52.8 while using less than 31% of its inference latency. The Short, Medium, and Long MM-NIAH scores for Docopilot-8B are 71.2, 57.4, and 55.3 respectively. On MMLongBench-Doc, Docopilot-8B obtains Accuracy 28.8 and F1 23.0, compared with 17.4 and 16.5 for InternVL2-8B and 24.2 and 24.5 for InternVL2-8B+RAG. On DocGenome, Docopilot-8B reports Class Acc 93.8, Title ED 2.0, Abstract ED 19.7, Single-Page Acc 53.9, and Multi-Page Acc 51.9; the multi-page improvement over InternVL2-8B is particularly emphasized, since the latter reaches 46.1 on the same metric (Duan et al., 19 Jul 2025).

On MP-DocVQA, Docopilot-8B reaches 81.3 ANLS versus 79.3 for InternVL2-8B; Docopilot-2B reaches 76.2 versus 71.8 for InternVL2-2B. On single-page benchmarks, performance is described as comparable rather than dominant: DocVQA 92.0 versus 91.6, ChartQA 83.3 versus 83.3, and InfoVQA 73.3 versus 74.8 for Docopilot-8B relative to InternVL2-8B. This pattern supports the paper’s claim that the method improves document-level reasoning without sacrificing standard page-level competence.

Benchmark	Docopilot-8B	Comparator
MM-NIAH overall	61.8	InternVL2-8B: 42.9
MMLongBench-Doc Acc / F1	28.8 / 23.0	InternVL2-8B: 17.4 / 16.5
MP-DocVQA ANLS	81.3	InternVL2-8B: 79.3
DocVQA	92.0	InternVL2-8B: 91.6
ChartQA	83.3	InternVL2-8B: 83.3

The ablation on dataset construction is important because it ties performance gains to Doc-750K rather than to architectural scaling alone. Starting from an InternVL2-2B baseline at 10.5 Accuracy and 10.8 F1 on MMLongBench-Doc, adding an open-source SFT corpus without Doc-750K gives 18.4 and 9.4. Adding Sci-Hub yields 18.5 and 15.2; adding arXiv yields 20.5 and 15.5; adding OpenReview for the full Doc-750K gives 21.8 and 16.0. This progressive trend is offered as evidence that the document-level task mixture, especially the inclusion of OpenReview review-reply conversations, contributes materially to long-document performance (Duan et al., 19 Jul 2025).

5. Relation to retrieval, layout-aware models, and later document systems

Docopilot is defined in part by its opposition to RAG-style document processing. The paper formalizes the comparison conceptually as

$T_{\text{total,end2end}} = T_{\text{encode}} + T_{\text{attn}} + T_{\text{decode}}$

for the end-to-end model, versus

$T_{\text{img}}$ 0

for RAG-based systems. The empirical latency data on MMLongBench-Doc follows this framing: Docopilot-8B and InternVL2-8B both report 81.0 ms per-token latency, whereas InternVL2-8B+RAG reports 113.4 ms; Docopilot-2B reports 35.9 ms, matching InternVL2-2B and substantially below InternVL2-2B+RAG at 82.9 ms (Duan et al., 19 Jul 2025).

In the broader document-AI landscape, Docopilot occupies a distinct position. Earlier systems such as UDOP unify text, image, and layout explicitly through a Vision-Text-Layout Transformer, token-level bounding boxes, and 2D relative attention bias, and they extend into document generation and editing as well as understanding (Tang et al., 2022). UDoc similarly emphasizes region-level multimodal pretraining, gated cross-attention, and OCR-derived semantic regions (Gu et al., 2022). By contrast, Docopilot deliberately avoids explicit layout embeddings, bounding-box supervision, and new hierarchical attention layers. It relies instead on long-context sequence modeling over page images and interleaved content.

This difference is not merely architectural; it reflects different problem formulations. Layout-aware foundation models such as UDOP are optimized for page-level structure grounding with explicit OCR and coordinate signals. Docopilot is optimized for continuity across pages, conversational context retention, and cross-page reasoning over long sequences. This suggests a methodological trade-off: explicit layout supervision yields stronger page-grounded structural modeling, whereas end-to-end long-context processing may better preserve cross-page coherence when retrieval fragmentation would otherwise intervene.

Subsequent work sharpens this contrast. DocSLM introduces a 2B-parameter small vision-LLM with a Hierarchical Multimodal Compressor and Streaming Abstention for long-document understanding under constrained memory resources. In DocSLM’s comparative table, Docopilot-8B retains the strongest MMLongDoc Accuracy at 28.8, while DocSLM-2B emphasizes token efficiency, constant-memory streaming, and lower latency at 32.1 ms relative to Docopilot-8B’s 81.0 ms. DocSLM also reports that Docopilot-2B uses approximately 3,133 tokens per image and reaches MMLongDoc 21.8 and MP-DocVQA 76.2, providing a later point of reference for Docopilot’s efficiency-accuracy position within the field (Hannan et al., 14 Nov 2025).

A plausible implication is that Docopilot established a baseline for native long-context document understanding, while later systems explored two divergent responses to its limitations: stronger explicit layout modeling on one side, and stronger compression or streaming efficiency on the other.

6. Limitations, deployment, and future directions

The paper identifies two systematic limitations. First, domain coverage is concentrated in academic and scientific documents, which may restrict generalization to business forms, invoices, medical records, and related settings. Second, the two supported input formats impose a trade-off: OCR-free multi-image inputs can introduce OCR errors, whereas the interleaved text-image format sacrifices layout fidelity (Duan et al., 19 Jul 2025).

The absence of bounding boxes and explicit reading-order metadata is a further structural limitation. Layout is preserved visually in the multi-image format but is not encoded as dedicated structured supervision. The paper also does not specify train/validation/test splits or licensing details for the dataset, although data, code, and models are released through the project repository. Training hardware and compute budget are not reported, beyond the use of Ring Attention and Liger Kernel, which imply memory-constrained large-context optimization.

For deployment, the paper recommends using Docopilot when document continuity matters, specifically for scientific papers, RFCs and specifications, contracts, and other settings where preserving pagination and layout is useful and multi-turn follow-ups require a consistent state. It also recommends controlling context size by concatenating adjacent pages and capping total images; the evaluation configuration uses a cap of 18 images. When text accuracy is primary and layout is secondary, the interleaved format is preferred; when layout is crucial, the multi-image format is preferred.

Future directions named in the paper include extending domain coverage beyond academic documents, incorporating layout-aware training signals such as bounding-box or reading-order embeddings, scaling context further, exploring sparse or hierarchical attention tailored to pages, and combining retrieval with contiguous page-range preservation when corpus breadth requires RAG. These directions indicate that Docopilot is best understood as a baseline for document-level multimodal understanding rather than as a complete endpoint. Its core contribution is to show that a native long-context multimodal model, trained on a high-quality document-level corpus and supported by packing, Ring Attention, and Liger Kernel, can outperform retrieval-based counterparts in multi-page reasoning while maintaining competitive single-page performance (Duan et al., 19 Jul 2025).