UniDoc-RL: Hierarchical RL for Visual Document QA

Updated 4 July 2026

The paper demonstrates that UniDoc-RL integrates coarse retrieval, precise selection, and active cropping via hierarchical reinforcement learning to boost visual document QA accuracy by up to 17.7%.
The framework formulates visual information acquisition as a sequential Thought-Action-Observation process, using stage-specific dense rewards to optimize retrieval, selection, and reasoning.
The unified method improves retrieval recall, crop precision, and final answer correctness by treating evidence acquisition as an interactive, multi-step RL problem.

Searching arXiv for UniDoc-RL and closely related RL/document-reasoning papers to ground the article with current citations. UniDoc-RL is a unified reinforcement learning framework for coarse-to-fine visual retrieval-augmented generation in large vision-LLMs. It addresses document question answering and reasoning settings in which an agent must not only retrieve relevant document images from a large-scale visual corpus, but also refine those candidates, localize evidence inside them, and generate a final answer. The framework formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space spanning retrieval, reranking, active visual perception, and reasoning, and it is trained with a dense multi-reward scheme under Group Relative Policy Optimization (GRPO) (Wang et al., 16 Apr 2026). In this formulation, visual RAG is treated not as a fixed preprocessing pipeline but as an interactive Thought-Action-Observation process in which the model progressively suppresses irrelevant content and attends to information-dense regions (Wang et al., 16 Apr 2026).

1. Conceptual framing and problem setting

UniDoc-RL studies coarse-to-fine visual RAG for large vision-LLMs in a setting with a query $Q$ , a large-scale visual corpus $\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ , and a target answer $y$ . The task is to retrieve relevant images, identify evidence inside them, and generate the final answer (Wang et al., 16 Apr 2026). The motivating assumption is that visually rich documents such as slides, scanned pages, charts, tables, and reports require more than coarse retrieval. In this setting, successful reasoning depends on progressively refining evidence from document-level search to region-level inspection.

The framework is motivated by three stated difficulties in visual RAG. First, retrieval noise is more harmful because visual corpora contain dense content and much redundancy, so coarse similarity search often returns images that are only superficially related. Second, important information is local: a full page or slide may contain many irrelevant regions, so feeding the entire image to the model can waste context and obscure key details. Third, optimization suffers from credit assignment if training relies only on sparse final-answer rewards, because failures can originate in retrieval, selection, cropping, or reasoning (Wang et al., 16 Apr 2026).

This places UniDoc-RL in a different design space from text-only RAG or single-pass document QA. A plausible implication is that the framework treats evidence acquisition itself as the core learning problem rather than as a static upstream component. That emphasis aligns it with broader attempts to unify retrieval and reasoning through RL, although UniDoc-RL is specifically document-visual and action-hierarchical rather than general-purpose RAG-RL (Li et al., 8 Aug 2025).

2. Hierarchical action space and Thought-Action-Observation dynamics

UniDoc-RL introduces a coarse-to-fine hierarchical action space with four explicit stages: Image Search, Precise Selection, Visual Perception, and Answer generation (Wang et al., 16 Apr 2026). The interaction is modeled as a Thought-Action-Observation process in which, at each step $t$ , the policy $\pi_\theta$ produces a thought $T_t$ and action $A_t$ , receives observation $O_t$ , and repeats until it emits an answer.

In the Image Search stage, the model emits a search query using <search> ... </search>, and the environment performs retrieval: $O_t = \text{Search}(q, \mathcal{C}).$ This returns a candidate pool of images or document pages (Wang et al., 16 Apr 2026).

In the Precise Selection stage, the model uses <select> ... </select> to choose relevant images from the retrieved candidates: $O_{t+1} = \text{Select}(O_t, \mathcal{I}),$ where $\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ 0 denotes selected indices. This stage is presented as LVLM-based reranking that narrows the semantic gap left by generic retrieval (Wang et al., 16 Apr 2026).

In the Visual Perception stage, the model performs active region-level perception with <bbox> ... </bbox>, and the environment crops or zooms the specified region: $\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ 1 This is the framework’s active perception mechanism: instead of passively consuming full images, the agent identifies an information-dense subregion and reasons over a higher-resolution crop (Wang et al., 16 Apr 2026).

When sufficient evidence has been collected, the model emits <answer> ... </answer> and terminates. The paper interprets this hierarchy as mirroring human information seeking: search broadens coverage, selection suppresses irrelevant hits, perception zooms into key regions, and reasoning extracts the answer (Wang et al., 16 Apr 2026).

The rollout procedure is iterative. The policy generates a response, the system parses the special action tags, executes one of search, select, crop, or answer, and injects the resulting observation back into the dialogue history as a new “User” message. This insertion strategy is explicitly justified as aligning the interaction format with the pretraining distribution of LVLMs, where visual inputs often appear in user prompts (Wang et al., 16 Apr 2026). This suggests that UniDoc-RL is not only an RL architecture but also a prompt-format engineering choice designed to preserve compatibility with pretrained multimodal conversational behavior.

3. Reward design and GRPO optimization

A central contribution of UniDoc-RL is its dense multi-reward supervision. Rather than relying only on outcome reward from final answer correctness, the framework assigns stage-specific rewards for action syntax, retrieval quality, image selection, region localization, and answer quality. The total reward is

$\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ 2

where $\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ 3 is the pattern or format reward, $\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ 4 is the image retrieval reward, $\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ 5 is the selection reward, $\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ 6 is the visual perception or cropping reward, and $\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ 7 is the answer correctness reward (Wang et al., 16 Apr 2026).

The pattern reward checks whether the agent follows the required action syntax, including proper use of <search>, <select>, and <bbox> tags. The retrieval reward is defined with NDCG over the trajectory’s retrieved candidates: $\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ 8 where $\mathcal{C} = \{c_1, c_2, \dots, c_N\}$ 9 is the merged ranked candidate list over retrieval steps and $y$ 0 is the set of gold relevant images. The trajectory-level candidate list is defined by interleaving candidates from multiple retrieval rounds: $y$ 1 This reward is intended to encourage both retrieval of relevant evidence and favorable ranking of that evidence (Wang et al., 16 Apr 2026).

Selection reward is supervised by whether a chosen image belongs to the relevant set: $y$ 2 with

$y$ 3

The paper identifies a failure mode in which no gold image appears in a retrieved candidate set, making the reward uniformly zero. To address this, it uses a pseudo-supervision strategy that treats the top-ranked retrieval candidate $y$ 4 as a pseudo-positive target in such cases (Wang et al., 16 Apr 2026).

The crop reward is defined by intersection-over-union between predicted and annotated regions: $y$ 5 where $y$ 6 are ground-truth boxes and $y$ 7 are predicted boxes. The answer reward is scored by a model-based reward model,

$y$ 8

and the appendix specifies Qwen2.5-72B-Instruct as a binary judge producing 0/1 correctness (Wang et al., 16 Apr 2026).

Optimization uses GRPO. UniDoc-RL employs a two-stage pipeline consisting of full-parameter supervised finetuning as a warm start, followed by GRPO RL finetuning that optimizes the dense multi-reward objective (Wang et al., 16 Apr 2026). The appendix reports a group size of 5, a KL loss coefficient of 0.01, actor learning rate $y$ 9, one RL epoch, max prompt length 40000, and max response length 1024 (Wang et al., 16 Apr 2026). The paper emphasizes GRPO’s appeal in avoiding a separate value network.

This reward design is notable against a broader methodological backdrop. A critique of many GRPO-style LLM RL systems is that common LLM-as-MDP assumptions can collapse RL into outcome-driven supervised learning when states are token prefixes and terminal reward is uniformly distributed across tokens (Samineni et al., 19 May 2025). UniDoc-RL’s action space and reward structure differ materially from that degenerate setup: it supervises retrieval, selection, and cropping as distinct subtasks with dense rewards tied to external tool interaction and annotated intermediate actions (Wang et al., 16 Apr 2026). This suggests that UniDoc-RL is closer to a genuine sequential control problem than outcome-only RL post-training.

4. Data curation and annotated reasoning trajectories

UniDoc-RL relies on a curated dataset of reasoning trajectories with fine-grained action annotations. The source datasets are SlideVQA, DoubleBench, VisR-Bench, DocBench, and DUDE, covering slides, PDFs, scanned documents, multilingual settings, tables, charts, figures, and long-context document understanding (Wang et al., 16 Apr 2026).

Trajectory synthesis is performed by a strong teacher model, Qwen3-VL-235B, sampling

$t$ 0

The teacher generates search, selection, and crop decisions. For visual perception, the paper describes a more involved pipeline: layout parsing with Mineru generates candidate bounding boxes, the teacher chooses the most useful region among those candidates, and that region is stored as the ground-truth crop action (Wang et al., 16 Apr 2026). This procedure is intended to produce crop annotations more accurate than naive box generation.

The filtering pipeline contains three components. Quality filtering discards trajectories whose final answer is wrong. Difficulty-aware filtering retains examples that are not trivial for an intermediate SFT model. RL data curation keeps challenging but solvable samples where retrieval works but reasoning or perception can still fail (Wang et al., 16 Apr 2026). The paper states that this matters because RL works best on data where improvement is possible but not too easy.

The final curated dataset contains 12,621 samples for SFT and 5,537 samples for RL (Wang et al., 16 Apr 2026). The SFT breakdown is SlideVQA 8271, DoubleBench 1274, VisR-Bench 1518, DocBench 657, and DUDE 901. The RL breakdown is SlideVQA 2613, DoubleBench 993, VisR-Bench 1630, DocBench 301, and DUDE not used for RL in the table (Wang et al., 16 Apr 2026).

This dataset design is important for interpreting the method. UniDoc-RL is not trained from scratch on undifferentiated reward signals; it is initialized with structured trajectories that expose the full action hierarchy. A plausible implication is that the framework depends heavily on annotation quality and curriculum design, which the paper later acknowledges indirectly through its practical caveats (Wang et al., 16 Apr 2026).

5. Experimental setup, baselines, and quantitative results

The framework is evaluated on three benchmarks: SlideVQA, ViDoSeek, and MMLongBench. SlideVQA includes single-hop and multi-hop slide reasoning; ViDoSeek emphasizes extraction and logic on visually rich documents; MMLongBench focuses on long-context document understanding with strong reliance on visual content (Wang et al., 16 Apr 2026). Experiments use two LVLM backbones: Qwen2.5-VL-3B-Instruct and Qwen2.5-VL-7B-Instruct (Wang et al., 16 Apr 2026).

The baseline set includes Vanilla RAG, ReAct, Search-R1(-VL), and VRAG-RL. The paper distinguishes OCR-based RAG methods from purely visual RAG methods and uses an external evaluation model returning binary correctness, with accuracy as the main metric (Wang et al., 16 Apr 2026).

On Qwen2.5-VL-3B-Instruct, the paper reports overall accuracy of 53.5 for VRAG-RL and 71.0 for UniDoc-RL (Wang et al., 16 Apr 2026). On Qwen2.5-VL-7B-Instruct, it reports 57.1 for VRAG-RL and 74.8 for UniDoc-RL (Wang et al., 16 Apr 2026). The paper highlights gains of up to 17.5% on the 3B model and 17.7% on the 7B model over the strongest prior RL baseline (Wang et al., 16 Apr 2026). It states that UniDoc-RL consistently surpasses state-of-the-art baselines across all benchmarks and model sizes.

The interpretation offered in the paper is threefold. OCR-only pipelines are weaker because they lose layout and spatial cues. Purely visual retrieval is better than OCR-only retrieval. RL with hierarchical actions performs best because it jointly improves retrieval quality, candidate filtering, local evidence extraction, and final reasoning (Wang et al., 16 Apr 2026).

In relation to neighboring work, UniDoc-RL differs from CogDoc, which also studies unified document reasoning but organizes it as “Fast Reading” and “Focused Thinking” and reports that direct RL can outperform SFT+RL because SFT induces “policy conflict” between global localization and local grounding (Xu et al., 14 Dec 2025). UniDoc-RL instead uses SFT warm start plus GRPO finetuning and a more explicit action hierarchy of search, select, crop, and answer (Wang et al., 16 Apr 2026). The difference suggests that “unified document RL” is not a single design pattern: some systems unify coarse and fine reasoning through mode switching, whereas UniDoc-RL unifies them through executable hierarchical actions.

6. Ablations, behavioral analysis, and limitations

The action-space ablation studies three variants: no selection and no perception, selection only, and selection plus perception. On Qwen2.5-VL-3B-Instruct, the reported scores are 66.6, 70.0, and 71.0, respectively (Wang et al., 16 Apr 2026). The stated conclusion is that both actions help, with selection giving the larger boost on reasoning-heavy tasks and perception helping most on detail-heavy tasks.

The reward ablation incrementally adds retrieval reward, selection reward, and visual perception reward on top of a vanilla sparse-reward setup. The best results come from the full dense setup, and the paper interprets each reward as aligned with its corresponding subtask: retrieval reward improves candidate quality, selection reward improves semantic filtering, and perception reward improves fine-grained localization (Wang et al., 16 Apr 2026).

The paper also reports behavioral analyses beyond end-task accuracy. Adding the selection step increases retrieval hit rate of ground-truth images from 79.7% to 85.0% on SlideVQA, from 74.8% to 85.7% on ViDoSeek, and from 48.9% to 52.7% on MMLongBench (Wang et al., 16 Apr 2026). It further states that SFT models are conservative, using crop actions too rarely and often defaulting to passive full-image reading, whereas after RL training crop frequency rises and becomes closer to the teacher’s behavior (Wang et al., 16 Apr 2026). Qualitative examples reportedly show that SFT often produces trivial crops close to the full image, while UniDoc-RL learns more precise crops around the true region of interest (Wang et al., 16 Apr 2026). Case studies indicate adaptive strategy: crop when evidence is small and hidden, skip crop when the selected image is already readable (Wang et al., 16 Apr 2026).

These observations are significant because they suggest that the method changes action behavior rather than merely answer distribution. In that respect, UniDoc-RL resembles broader RL systems that seek to learn when to invoke tools, not only how to produce final responses. UR $t$ 1, for example, frames retrieval as a dynamic policy choice and argues that the model should learn to reason directly when it can and search when it must (Li et al., 8 Aug 2025). UniDoc-RL instantiates a comparable principle in the visual-document domain, extending the tool-use policy down to image reranking and crop selection (Wang et al., 16 Apr 2026).

The paper does not foreground limitations, but several are stated or implied. The method depends on high-quality action annotations and curated trajectories. It uses external tools such as search and layout parsing, so performance is partly tool-dependent. The action space and reward design are task-specific and may require re-annotation or reward redesign for new domains. The system relies on a strong reward model for answer correctness, which can introduce evaluation bias. The crop reward assumes access to ground-truth regions during training, which is not always available in real applications (Wang et al., 16 Apr 2026).

Taken together, UniDoc-RL presents visual RAG as a hierarchical RL control problem in which retrieval, semantic filtering, active perception, and reasoning are jointly optimized. Its main empirical claim is that this integration improves retrieval recall, crop precision, tool-use behavior, and final accuracy relative to prior visual RAG and RL baselines (Wang et al., 16 Apr 2026). A plausible broader implication is that document reasoning systems benefit when intermediate evidence-acquisition steps are made explicit, executable, and directly rewarded, rather than treated as latent or purely prompt-induced behavior.