Papers
Topics
Authors
Recent
2000 character limit reached

PR² Pipeline: Perceiver, Reasoner, Reviewer

Updated 1 January 2026
  • PR² is a multi-agent annotation system that unifies forensic analysis by coordinating a Perceiver for initial drafts, a Reasoner for structured logic, and a Reviewer for quality control.
  • The system applies cross-cue chain-of-thought reasoning to jointly handle image forgery detection, precise grounding of tampered regions, and generation of free-form explanations.
  • It leverages GRPO-based optimization with weighted rewards for format, grounding, and explanation, achieving state-of-the-art performance on benchmarks like RealText.

LogicLens is a unified framework for text-centric image forgery analysis that jointly addresses detection, grounding, and explanation of forgeries via deep visual-logical co-reasoning. It introduces an end-to-end generative objective, advanced chain-of-thought reasoning, novel alignment mechanisms, and a new large-scale dataset, establishing state-of-the-art performance in the automated analysis of text-based image manipulations (Zeng et al., 25 Dec 2025).

1. Joint Visual-Logical Forgery Analysis: Problem and Objective

LogicLens reformulates text-centric forgery analysis as a single generative task. Given an image II (such as a scene or document containing text) and prompt TT, the system must:

  • Classify whether II is authentic or forged (detection),
  • Predict pixel/bounding-box regions of manipulated texts (grounding),
  • Generate a free-form rationale for its verdict (explanation).

The model outputs an “analysis report” R=(c,B,E)R = (c, B, E), where c{authentic, forged}c \in \{\text{authentic, forged}\} is the verdict, B={b1,...,bn}B = \{b_1, ..., b_n\} are the detected tampered regions, and EE is the textual explanation. The objective maximizes the conditional log-likelihood:

θ^=argmaxθE(I,T,R)D[logPθ(RI,T)]\hat \theta = \arg\max_\theta \mathbb{E}_{(I, T, R) \sim \mathcal D}\bigl[\log P_\theta(R | I, T)\bigr]

where RR is generated autoregressively, decomposing the likelihood across all output tokens. Unified evaluation aggregates F₁ for detection, mean IoU/mF₁ for grounding, and semantic similarity for explanation into a macro-average score:

M-F1=13(F1detection+F1grounding+F1explanation)\text{M-F}_1 = \tfrac{1}{3}(F_1^\text{detection} + F_1^\text{grounding} + F_1^\text{explanation})

This formulation eliminates the siloed handling of these subtasks, instead modeling their interdependence within a unified decoding process.

2. Annotation Pipeline: The PR² (Perceiver, Reasoner, Reviewer) System

High-quality, cognitively-aligned annotations are essential for end-to-end training and evaluation. LogicLens introduces the PR² (Perceiver, Reasoner, Reviewer) multi-agent pipeline:

  • Perceiver: Consumes the image (and RGB-mask overlays) plus OCR transcripts; drafts an unstructured forensic analysis noting anomalies, candidate bounding boxes, and preliminary rationales.
  • Reasoner: Refines drafts into a structured six-stage Cross-Cue Tree-of-Thought (CCT) outline, assembles a coarse report, and scores format correctness, logical completeness, and grounding consistency.
  • Reviewer: Accepts the annotation if the quality score QQ \geq threshold (e.g., 98/100), or issues feedback and loops for up to 3 review cycles.

Accepted outputs are serialized as triples (c,B,E)(c, B, E) for model training, ensuring consistency and high annotation quality at scale.

3. Cross-Cues-aware Chain of Thought (CCT) Reasoning

At the core of LogicLens is a six-stage Cross-Cues-aware Chain of Thought (CCT), orchestrating deep visual-logical reasoning via iterative cross-modal validation:

  1. Knowledge Preparation: Aggregates image context, OCR tokens, world-knowledge triggers, and forensic-knowledge cues into h1h^1.
  2. Visual Cue Extraction: Extracts features via a vision backbone, producing global consistency (vgv_g) and local anomaly cues (vlv_l). These are merged into h2h^2.
  3. Logical Cue Extraction: Analyzes OCR tokens for internal consistency (e.g., arithmetic, dates) and semantic plausibility, assembling symbolic logical cues LL, further aggregated into h3h^3.
  4. Cross-Cue Validation & Filtering: Applies a learned scoring function s()s(\cdot) to select high-value cues from VLV \cup L.
  5. Grounding: Aligns each high-value cue with the most relevant OCR token and its box, collecting tampered regions BB into h5h^5.
  6. Report Synthesis: Decides the verdict cc and synthesizes the rationale EE from the filtered cues, emitting the structured report.

Each stage updates the hidden state hkh^k by aggregating modality-specific extractions. This pipeline explicitly cross-validates visual and logical evidence, increasing the robustness and interpretability of the system’s verdicts.

4. Weighted Multi-Task Reward and GRPO-Based Optimization

Model alignment and performance are driven by a weighted multi-task reward during RL fine-tuning, combining:

  • Format Adherence (RformatR_\text{format}): Fraction of required structural tags present in the output.
  • Grounding Reward (RgroundR_\text{ground}): Composite of classification match, count-match, and IoU-tiered reward; e.g., Rcls=I(c=c^)R_\text{cls} = \mathbb{I}(c=\hat{c}), RnumR_\text{num} depending on match cardinality, and RiouR_\text{iou} tiered by mean IoU.
  • Explanation Reward (RexplainR_\text{explain}): Cosine similarity between predicted and ground-truth rationale embeddings.

Total reward:

Rtotal=λfRformat+λgRground+λeRexplainR_\text{total} = \lambda_f R_\text{format} + \lambda_g R_\text{ground} + \lambda_e R_\text{explain}

with empirically set weights (λf,λg,λe)=(0.15,0.75,0.10)(\lambda_f, \lambda_g, \lambda_e) = (0.15, 0.75, 0.10).

Fine-tuning employs an actor–critic policy-gradient approach (GRPO), updating model parameters θ\theta according to the reward advantage over a learned baseline VϕV_\phi. This facilitates holistic optimization of all sub-components.

5. Dataset Construction: RealText Benchmark

RealText is a large-scale dataset constructed to support unified training and evaluation, with the following properties:

  • Sourced from COCO-Text, ReCTS, LSVT, OSFT, then filtered via GPT-4o (scene) and DINOv2 (visual quality) to ~50,000 candidates.
  • Final selection: 5,397 images (905 dense-text, i.e., ≥10 text lines) with rich annotations: authenticity label cc, pixel-level/box masks BB for tampered regions, and free-form textual rationales EE.
  • Manipulation types: copy-move, AIGC inpainting, text replacement, logo swapping, arithmetic/date tampering.

Unlike T-IC13 or T-SROIE, RealText is the first public benchmark at this scale to provide unified detection, grounding, and explanation labels for forensic model training and validation.

6. Empirical Performance

LogicLens demonstrates substantial empirical gains across multiple benchmarks, outperforming generalist and specialist baselines:

Dataset Detection F₁ mF₁ (grounding) BS-F₁ / CSS (explanation) Macro-F₁ (M-F₁)
RealText (fine-tuned) 93.2 36.7 76.9 / 85.0 68.9
T-IC13 (zero-shot) 93.2 67.6 78.5 79.8
T-SROIE (zero-shot dense) 99.4 11.0 77.0 62.5
FakeShield (M-F₁) 40.6
GPT-4o (M-F₁) 55.8
Gemini-2.5-Pro (M-F₁) 53.9
InternVL-3.5 (M-F₁) 48.8
Qwen2.5-VL (M-F₁ SFT) 36.3

LogicLens delivers pronounced gains in joint reasoning and grounding, particularly in zero-shot dense-text scenarios.

7. Limitations and Future Directions

LogicLens exhibits several strengths: state-of-the-art macro-F₁, unified reasoning, and interpretable outputs. Limitations include dependency on OCR quality, required balancing of reward weights (λf,λg,λe)(\lambda_f, \lambda_g, \lambda_e), and computational cost (~7B parameters) precluding real-time deployment.

Possible future directions include:

  • End-to-end trainable OCR and vision backbones for reduced error propagation,
  • Extension to video forgeries and temporal consistency reasoning,
  • Increased adversarial robustness via CCT augmentation,
  • Model distillation for efficient deployment at the edge.

A plausible implication is that LogicLens’s unified, chain-of-thought, and multi-task aligned approach provides a scalable paradigm for multimodal forensic reasoning well beyond current detection-only systems (Zeng et al., 25 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to PR$^2$ (Perceiver, Reasoner, Reviewer) Pipeline.