Reason-in-Documents Module: Coarse-to-Fine Reasoning
- The paper introduces a cohesive, two-phase reasoning module that first selects salient document pages in a fast reading phase and then generates detailed answers through focused high-resolution analysis.
- The module integrates low-resolution and high-resolution processing via a unified multimodal Transformer, bridging text and visual features for robust document understanding.
- Empirical results show state-of-the-art performance on challenging benchmarks, demonstrating superior factuality and noise robustness with direct RL training.
A Reason-in-Documents module is a specialized neural (typically Transformer-based) subsystem designed for solving complex reasoning tasks grounded in multi-page, possibly multimodal documents. Its function is to efficiently locate, select, and deeply process relevant document content in order to generate high-fidelity answers to queries that require nontrivial integration of distant information, especially when input size or multimodal detail presents prohibitive computational challenges. Such modules have become integral in settings that demand robust factuality, interpretability, and noise robustness across long-context, visually rich, or logic-intensive documents.
1. Coarse-to-Fine Reasoning Framework
Central to the CogDoc Reason-in-Documents module is a two-phase coarse-to-fine thinking architecture that mirrors human document engagement (Xu et al., 14 Dec 2025). The system first processes the entire document in a lightweight, low-resolution "Fast Reading" mode that rapidly identifies a subset of salient pages. This phase operates over downsampled images or compressed textual content, enabling tractable processing of long documents. The identified candidate pages are then subjected to a high-resolution "Focused Thinking" phase, in which the model attends to full-resolution content and generates detailed reasoning chains and answers. Both phases are implemented through a unified multimodal Transformer policy ; the mode switch occurs via a system prompt update, maintaining architectural alignment and state continuity.
Workflow Summary:
| Phase | Input | Operation | Output |
|---|---|---|---|
| Fast Reading | Full doc (low-res), | Page selection | Relevant subset , trajectory |
| Focused Thinking | Retrieve(, ), | Detailed reasoning/answer | Chain-of-thought, final answer |
The unification of both stages within a single policy is critical for memory and representation sharing, allowing the model to internalize and reuse context across reasoning modes.
2. Multimodal Transformer Architecture
The neural backbone consists of a sequence-to-sequence multimodal Transformer similar to Qwen2.5-VL (Xu et al., 14 Dec 2025). The architecture incorporates:
- Patch-based Vision Encoder: A ViT-style CNN encodes each page (low or high resolution, depending on the phase) into patch embeddings.
- Text Embedding: Input question and prompts are tokenized and embedded.
- Cross-modal Projector: Vision embeddings are aligned to the LLM token space via linear mapping layers.
- Unified Transformer Stack: 12–24 layers support joint attention across text and vision streams, enabling complex interaction between textual clues and visual features.
- Dual Heads: A page-selection head (softmax over page IDs) governs the localization phase, while an autoregressive language-modeling head produces answer tokens during focused reasoning.
Page representation varies by phase: low-resolution () single-CLS-per-page for selection; high-resolution () densely patched for detailed analysis.
3. Reinforcement Learning Formulation
This module is trained end-to-end as a Markov Decision Process (MDP) that encompasses both phases:
- State Space : Encapsulates model internal and environment (document, question) states.
- Action Space : In Phase I, discrete page selection; in Phase II, next-token prediction for answer generation.
- Transition Function : Advances system state corresponding to user input and model output, switching modalities via prompt when Phase I terminates.
- Rewards: Sparse, terminal-reward regime:
- Localization reward: average of page selection accuracy and recall.
- Reasoning reward: output of an LLM-based evaluator scoring detailed answer fidelity.
Policy optimization employs a clipped-surrogate objective with KL-regularization:
Batch rewards are standardized; no explicit value baseline is used. An optional entropy bonus encourages exploratory policies.
4. Post-Training Strategy: Direct RL vs. SFT+RL and Policy Conflict
Empirical investigation reveals that supervised pretraining (SFT) on phase-specific tasks followed by RL suffers from "policy conflict": the model's shared representation is driven into an imbalanced regime, such that subsequent RL in one phase degrades performance in the other. Such conflict is observable as rising perplexity on phase-specific validation sets during fine-tuning for the alternate phase. Direct RL initialized from scratch (random weights) with mixed-phase batch sampling circumvents this, allowing the policy to jointly discover representations that are well-calibrated for both fast localization and deep reasoning by simply changing prompt context. Ablation studies demonstrate stable self-consistency and no antagonistic degradation for direct RL, with superior performance compared to SFT+RL (Xu et al., 14 Dec 2025).
5. Algorithmic Implementation
Training Procedure: The direct RL algorithm alternates between these steps:
- Sample mini-batches of queries.
- For each sample:
- Run low-res encoding and sample Phase I trajectory (page selection).
- Retrieve selected pages, encode at high-res, then sample Phase II trajectory (answer tokens).
- Compute respective terminal rewards.
- Standardize batch rewards and compute policy gradient updates as per the clipped surrogate objective.
- Apply periodic reference policy update for KL regularization.
Inference: On a new input, the model deterministically selects pages (greedy sampling), retrieves them at high-res, and generates an answer—all via single-pass system prompt switching.
6. Empirical Results and Scope
The CogDoc Reason-in-Documents module, instantiated as a 7B-parameter model, establishes state-of-the-art results within its class on challenging multimodal document benchmarks, surpassing even proprietary large-scale models such as GPT-4o (Xu et al., 14 Dec 2025). Its two-phase, RL-trained, unified approach efficiently bridges the scale–fidelity trade-off that previously hampered long-document reasoning. The framework is directly applicable to settings involving visually-rich and multi-page documents, enabling reliable document localization, robust fine-grained reasoning, and interpretable answer chains via a single adaptable, RL-honed multimodal policy.
For further reading on decoupling retrieval and reasoning capabilities, regime-specific supervision, and error analysis in Reason-in-Documents modules, see DeR2 (Ying et al., 29 Jan 2026); for rubric-driven relevance with RL optimization in document retrieval, Retro* (Lan et al., 29 Sep 2025). These works collectively elaborate the breadth and criticality of Reason-in-Documents modules in modern document-grounded question answering and retrieval-intensive, multi-stage neural pipelines.