DocLens: Vision-Language & Medical QA

Updated 21 November 2025

DocLens is an architectural framework for evidence-based document QA and medical text generation, enabling granular claim decomposition and precise answer selection.
It utilizes a multi-agent design with specialized OCR, layout detection, and sampling-adjudication modules to isolate and process relevant visual and textual evidence.
DocLens demonstrates state-of-the-art performance by integrating claim-level evaluation metrics and reinforcement learning to enhance factuality and mitigate hallucinations.

DocLens is an architectural framework and evaluation methodology designed to address fine-grained, evidence-based understanding and assessment in both vision-language document QA and medical text generation. Originating in medical NLP as a claim- and citation-level metric suite, DocLens has also been developed into a tool-augmented, multi-agent retrieval and reasoning system for long visual document comprehension. Its methodological core is precise evidence localization, granular claim decomposition, and entailment-driven answer selection, facilitating robust performance on vision-centric, unanswerable, and factuality-driven tasks (Zhu et al., 14 Nov 2025, Jhaveri et al., 26 Sep 2025, Xie et al., 2023).

1. Multi-Agent Framework for Long Visual Document Understanding

DocLens addresses the challenge of information dispersion in long, complex documents containing extensive textual and visual elements by decomposing the QA process into specialized multi-agent modules. Its architecture, formalized as $A = f(\mathcal{D}, Q)$ , is factorized into:

Extraction stage: $\mathcal{S} = f_{\text{extract}}(\mathcal{D}, Q)$
Generation stage: $A = f_{\text{generate}}(\mathcal{S}, Q)$

The extraction stage consists of the Lens Module (Page Navigator, Element Localizer). The generation stage consists of the Reasoning Module (Answer Sampler, Adjudicator). The operational workflow is:

Optical Character Recognition (OCR) is run on all document pages to yield pagewise text segments $T_i$ .
The Page Navigator agent uses the set $(P_i, T_i)$ to select relevant page indices $E_\text{pred}$ through repeated LLM-based sampling.
The Element Localizer parses $E_\text{pred}$ , detects and crops all salient visual elements $V_k$ (tables, charts, figures), and combines them with text $T_k$ .
The Answer Sampler agent generates several candidate answer–reason chains $(R_i, A_i)$ from condensed evidence $\mathcal{S} = \{ (P_k, T_k, V_k) \}$ .
The Adjudicator agent selects the final answer $A_\text{final}$ from candidate outputs based on consensus scoring (Zhu et al., 14 Nov 2025).

2. Evidence Localization: Page and Element Level

DocLens’s “zoom-in” localization paradigm operates at macro (page) and micro (element) levels to isolate minimal sufficient evidence:

Page Navigator: Given $(P_i, T_i)$ for each page, repeated LLM sampling (temperature $\tau$ , $T_e$ trials) produces a union $E_\text{pred} = \bigcup_{j=1}^{T_e} E^{(j)}$ of probable evidentiary pages. Large contexts are handled via parallel chunking.
Element Localizer: Each $P_k \in E_\text{pred}$ is passed through a layout-detector (e.g., MinerU), identifying bounding boxes for tables, charts, or figures. Visual snippets $V_k = \text{Crop}(P_k, \text{bbox})$ are extracted to construct the final evidence set (Zhu et al., 14 Nov 2025).

This explicit two-tier navigation yields high-recall, high-precision evidence selection, critical for mitigating hallucinations and ensuring traceability in vision-centric QA.

3. Sampling–Adjudication Reasoning

The reasoning pipeline consists of stochastic candidate answer sampling and adjudication:

Answer Sampler: For every question–evidence pair $(Q, \mathcal{S})$ , $T_a$ answer chains are generated at sampling temperature $\tau$ . Each is sampled from $p(A_i|Q, \mathcal{S}) \propto \exp(\mathrm{score}(A_i)/\tau)$ .
Adjudicator: All candidate chains $\{ (R_i, A_i) \}$ are reviewed for logical consistency and factual consensus. The final answer is selected as $A_\text{final} = \arg\max_{A'} \mathrm{ConsensusScore}(A')$ , where the consensus is derived via LLM prompting (Zhu et al., 14 Nov 2025).

4. Tool Augmentation and System Agents

DocLens employs external tools to preprocess and contextualize evidence for its agents:

OCR tool (MinerU): Extracts $T_i$ from page images.
Layout-Detection tool (MinerU): Identifies visual element locations.
Crop tool: Segments visual elements from pages.

The Page Navigator and Element Localizer orchestrate these tools, while the Answer Sampler and Adjudicator consume the distilled evidence and reason purely in the LLM domain (Zhu et al., 14 Nov 2025).

5. Metrics and Empirical Evaluation

DocLens has achieved state-of-the-art results on benchmarks emphasizing long-document and vision-centric reasoning:

Benchmark	DocLens w/ Gemini-2.5-Pro	Human Expert Baseline	Next-best Model
MMLongBench-Doc (Overall acc.)	67.6%	65.8%	-
Unanswerable (UNA)	72.2% (+13.8% over baseline)	59.9%	-
FinRAGBench-V (Overall acc.)	70.4%	—	64.9% (best OCR-aug)
FinRAG-V (CHART Qs)	+10.9% over baseline	—	—

Page Navigator: Recall 97.3% (vs. 68% for direct VLM), precision 55.1%, answer accuracy 67.6%.
Element-level citation F1: +6.7% over other systems (precision +4.9%, recall +9.3%).
Ablation: Removal of the Lens Module degrades accuracy by up to 5.3% on FinRAG; the Reasoning Module is critical to reliable unanswerable-answer detection (Zhu et al., 14 Nov 2025).

6. Application to Medical Text Generation

Originally, DocLens was developed as a fine-grained, claim-level evaluator for medical text generation:

Claim Recall: Fraction of reference clinical facts in the output, $\frac{1}{|L_y|}\sum_{\ell \in L_y}\mathbb{I}[y' \models \ell]$ .
Claim Precision: Fraction of generated subclaims truly supported by reference, $\frac{1}{|L_{y'}|}\sum_{c \in L_{y'}}\mathbb{I}[y \models c]$ .
Attribution: Citation recall and precision metrics evaluate support and necessity of input sentences for generated statements.

Evaluators can be instruction-following LLMs (GPT-4, Mistral), or supervised NLI models (TRUE). Across clinical note generation, radiology report summarization, and patient question summarization, DocLens metrics exhibit higher agreement with human expert judges than traditional metrics (ROUGE, BLEU, MEDCON) (Xie et al., 2023).

7. Role as Objective Critic in RL Training

DocLens has also served as a deterministic “critic” in reinforcement learning for medical text generation tasks. The evaluation-integrated RL framework employs claim-level extraction, LLM-based entailment, and a deterministic F1-based reward without needing a separate learned reward model. Integrated within Group Relative Policy Optimization (GRPO), this yields improved completeness and factual grounding, with rapid convergence and measurable reduction in hallucination and omission rates. For instance, in dialogue-to-SOAP summarization, F1 improves from 0.7317 (base) to 0.7819 (GRPO, +6.9%), and GPT-5 qualitative assessment confirms superior factuality and brevity (Jhaveri et al., 26 Sep 2025).

8. Limitations and Outlook

DocLens’s current challenges include handling difficult vision-centric cases (e.g., tiny map symbols, subtle trend comparisons), reliance on proprietary LLMs for best evaluation accuracy, and open-source model limitations in entailment and medical inference. Recommended directions include domain-specific pre-training, instruction-tuning, and extension to multimodal tasks beyond text (e.g., visual question answering). Expansion of evaluated dimensions to coherence and error risk is also proposed (Zhu et al., 14 Nov 2025, Xie et al., 2023).

DocLens integrates granular evidence retrieval, principled evaluation metrics, tool-augmented vision-text processing, and multi-agent reasoning. Its cross-domain impact is established in both vision-language understanding of long documents (Zhu et al., 14 Nov 2025) and clinical natural language generation (Xie et al., 2023, Jhaveri et al., 26 Sep 2025).