Cross-Page Reasoning in Document Intelligence

Updated 25 May 2026

Cross-page reasoning is a paradigm where AI systems synthesize and integrate evidence spread across multiple pages to answer complex queries.
It leverages multimodal inputs such as text, figures, tables, and formulas using structured memory and hierarchical decomposition to manage context.
Empirical benchmarks validate its effectiveness, though challenges in evidence retrieval under sparsity and context management remain.

Cross-page reasoning refers to the capacity of computational systems—particularly multimodal LLMs (MLLMs) and vision-LLMs (VLMs)—to locate, link, and synthesize information spread across multiple pages of a document in order to answer complex queries. This paradigm extends beyond single-page visual question answering (VQA) or text-based QA by addressing intrinsic challenges of evidence sparsity, semantic drift, multimodal fusion, and computational efficiency inherent to long, dense documents such as scientific papers, slide decks, ESG reports, and textbooks. Robust cross-page reasoning enables multi-hop integration over heterogeneous evidence (e.g., text, table, figure, formula) and underpins progress in document intelligence, retrieval-augmented generation, and machine understanding of structured or narrative documents.

1. Formal Definitions and Cross-Page Benchmarking

Cross-page reasoning is rigorously formalized in diverse recent benchmarks to enable objective evaluation and systematic comparison:

MMESGBench defines a cross-page question as any QA pair for which the set of supporting evidence pages $S(q) \subseteq \mathcal{P}$ satisfies $|S(q)| \geq 2$ , with $\mathcal{P}$ denoting all document pages. Accuracy and macro-F1 scores are reported separately for cross-page, single-page ( $|S(q)|=1$ ), and unanswerable ( $|S(q)|=0$ ) questions (Zhang et al., 25 Jul 2025).
MMCR benchmarks cross-page (or cross-source) reasoning in scientific papers as the ability to answer questions requiring multi-modal evidence (text, figures, tables, formulas) distributed across multiple pages or modalities. Evaluation protocols employ category-specific exact match metrics and circular option permutation to guard against chance-level performance (Tian et al., 21 Mar 2025).
FlipVQA-Miner constructs QA/VQA supervision data with explicit cross-page question–answer and text–figure–solution alignment, tracking how labels are distributed across pages, documents, and modalities (Wong et al., 20 Nov 2025).

Such definitions center the notion that the answer requires resolving and integrating at least two distinct document locations or sources.

2. Methodological Foundations and Current Architectures

Multiple, domain-specific and general frameworks have been proposed to realize cross-page reasoning:

Structured Memory and Evidence Spaces

VISOR maintains a persistent evidence space $\mathcal{E}$ comprising distilled observations $e_k^{\mathrm{pre}}, e_k^{\mathrm{post}}$ for each retrieved page $I_k$ , injected at every reasoning step to support joint cross-page synthesis. This evidence-centric memory enables the agent to accumulate multi-page evidence and supports context-bounded reasoning via a dynamic sliding window, preventing semantic drift and context overload (Shen et al., 10 Apr 2026).

Doc- $V^*$ implements a working memory $W_t = \mathrm{Concat}(S_0, S_1, \ldots, S_{t-1})$ to aggregate turn-by-turn evidence summaries, explicitly recording which pages have contributed to the agent's current knowledge state. This facilitates selective attention over a dynamically constructed evidence set and supports coarse-to-fine evidence aggregation via agentic navigation (Zheng et al., 15 Apr 2026).

Hierarchical Agentic Decomposition

SlideAgent explicitly decomposes document reasoning into global, page, and element levels, with dedicated agents for each granularity. The knowledge base $|S(q)| \geq 2$ 0 is constructed in a two-phase pipeline (knowledge construction, inference), recursively linking summary and fine-grained evidence into a structured form conducive to multi-hop cross-page inference (Jin et al., 30 Oct 2025).

Multimodal Global-Local Attention

GRAM couples single-page encoders with stacked document-level “doc token” layers, using decaying attention bias to promote global token interaction. Tokens from all pages are jointly propagated through document sub-layers, facilitating global message passing while maintaining page-local representations. Optional compression transformers (C-Former) enable scalable decoding for long input sequences (Blau et al., 2024).

Coarse-to-Fine Retrieval-Augmented Pipelines

DocR1 and VISOR introduce policies that first recover supporting pages using a coarse retrieval mechanism, then attend locally and globally to relevant content. Reward functions explicitly penalize incomplete or excessive retrieval and reward answer accuracy conditioned on correct evidence chaining (Xiong et al., 10 Aug 2025, Shen et al., 10 Apr 2026).

3. Algorithmic Techniques: Retrieval, Memory, and Reasoning Flow

Recent cross-page systems operationalize their reasoning by integrating several algorithmic components:

Component	Purpose	Representative Systems
Evidence Space / Working Memory	Accumulate and reference multi-page summaries	VISOR, Doc- $\|S(q)\| \geq 2$ 1
Global-Local Layering (Doc tokens)	Bi-level information exchange, reduce context	GRAM
Hierarchical Agent Decomposition	Structured, query-agnostic multi-level memory	SlideAgent
Coarse-to-Fine Page Selection	Page and element selection, reduce noise	DocR1, Doc- $\|S(q)\| \geq 2$ 2
Action Correction/Gating	Crop, retrieval evaluation and correction	VISOR

Retrieval methods include explicit page-level retrievers (ColPali, ColQwen), query expansion via subqueries, and dual retriever + fetch strategies (precision and recall balancing via $|S(q)| \geq 2$ 3 and $|S(q)| \geq 2$ 4) (Zheng et al., 15 Apr 2026, Zhang et al., 25 Jul 2025). Intent injection and memory-pinning stabilize agent behavior over long trajectories (Shen et al., 10 Apr 2026).

4. Experimental Evidence, Performance, and Error Analysis

Cross-page reasoning remains a notable bottleneck for current models, with empirical evaluation showing both progress and persistent challenges:

VISOR achieves 72.37% overall and 53.62% multi-hop accuracy on SlideVQA, outperforming strong RAG and agentic baselines by significant margins. Ablation studies confirm that removing the evidence space disables multi-page synthesis, with over 20-point drops, and omitting context management (sliding window + intent) induces severe noise accumulation (Shen et al., 10 Apr 2026).
DocR1, using its EviGRPO objective and curriculum of single- then multi-page training, increases multi-page answer recall to 91.5% on MP-DocVQA and yields +38.8 accuracy points over pre-curriculum baselines. It achieves +10 ANLS improvement on benchmarks requiring explicit cross-page linkage (Xiong et al., 10 Aug 2025).
SlideAgent reports an increase from 67.4% to 77.2% on multi-hop (cross-page) SlideVQA questions, attributing its gains to the agentic decomposition pipeline. Removing the page agent layer drops performance by up to 8.8 points, underlining the necessity of sequential intra- and inter-page reasoning (Jin et al., 30 Oct 2025).
Doc- $|S(q)| \geq 2$ 5 demonstrates that interactive agentic search outperforms static top- $|S(q)| \geq 2$ 6 RAG, with up to +49% accuracy gain on LongDocURL, and ablations quantify component importances (e.g., +4.9 points for thumbnail overview, +4.7 for page-by-page analysis, +3.4 for working memory) (Zheng et al., 15 Apr 2026).

Despite these advancements, major benchmarks such as MMCR and MMESGBench show that cross-page (and cross-source) tasks remain challenging for even state-of-the-art VLMs. GPT-4o, the top proprietary model, achieves only 48.55% accuracy on MMCR, with 20% in multi-table tasks; open-source models generally lag further behind. Multimodal and RAG models substantially outperform text-only baselines, yet exhibit a residual 3–6 percentage point drop from single-page to cross-page questions (Tian et al., 21 Mar 2025, Zhang et al., 25 Jul 2025).

Additional phenomena observed include negative impact of chain-of-thought prompting for small models on cross-page tasks, error concentration in perceptual and extraction failures, and a disproportionate rate of incorrect or incomplete cross-page evidence selection.

5. Data Construction, Annotation, and Supervision Pipelines

High-fidelity cross-page benchmarks and training corpora depend on annotation, QA extraction, and semantic alignment pipelines:

FlipVQA-Miner employs layout-aware OCR (MinerU), global block-graph parsing, and LLM-driven semantic grouping to mine cross-page QA pairs from textbooks, achieving near-perfect F1 and sub-2% error rates for both text and visual QA alignment (Wong et al., 20 Nov 2025).
MMESGBench leverages a multi-stage, human-AI collaborative validation framework to curate 239 cross-page QA pairs, representing 25.6% of the benchmark and spanning evidence distributed across text, tables, charts, layout, and images (Zhang et al., 25 Jul 2025).
DocR1 annotates evidence pages and validates multi-page alignment through LLM self-verification and curriculum-driven annotation, ultimately supporting fine-grained evidence-supervisory RL objectives (Xiong et al., 10 Aug 2025).

These pipelines serve as both training resources and evaluation yardsticks, revealing model weaknesses in complex question decompositions and multi-modal context coordination.

6. Open Challenges and Future Directions

Core technical obstacles for cross-page reasoning include:

Evidence retrieval under sparsity: Accurate, efficient identification and aggregation of sparse, multi-modal evidence remains unreliable, particularly when semantic cues are ambiguous or distributed.
Multimodal fusion and layout reasoning: Integrating tabular, chart, and free-text evidence across pages challenges both vision and language subsystems, especially in layout-dependent or multi-hop queries.
Long-range memory and context management: Effective memory is needed to avoid “lost in the middle” and semantic drift; straightforward context extension is inefficient and susceptible to noise accumulation.
Annotation and benchmarking complexity: Robust benchmarks demand fine-grained, multi-step annotation and rigorous validation to ensure true cross-page dependency.
Over-prediction of “unanswerable” classes: Smaller models may avoid challenging cross-page reasoning by defaulting to unanswerability, inflating negative-class performance but reducing informative F1.

Future research directions highlighted across the literature include: hierarchical or memory-augmented architectures for global evidence tracking, explicit multi-hop supervision, dynamic and context-adaptive retrieval, and community-driven expansion of high-difficulty cross-page benchmarks (Tian et al., 21 Mar 2025, Zhang et al., 25 Jul 2025).

7. Significance in Document Understanding and Multimodal AI

Cross-page reasoning constitutes a critical capability for AI systems aspiring to match the fidelity and flexibility of human document understanding. It underlies key use cases in scientific literature analysis, enterprise reporting, education (textbook VQA), and regulatory compliance. Systematic progress in this domain will rely on architectural advances in structured memory and retrieval, improved annotation and data pipelines for cross-page benchmarks, and principled evaluation across varied document types and evidence complexity. The progressive gains documented in recent agentic and evidence-aggregation methods demonstrate tangible advances, but substantial headroom remains before robust, scalable, and general cross-page reasoning is achieved across the full spectrum of visually rich, long, and multi-modal documents.