PR² Pipeline: Perceiver, Reasoner, Reviewer
- PR² is a multi-agent annotation system that unifies forensic analysis by coordinating a Perceiver for initial drafts, a Reasoner for structured logic, and a Reviewer for quality control.
- The system applies cross-cue chain-of-thought reasoning to jointly handle image forgery detection, precise grounding of tampered regions, and generation of free-form explanations.
- It leverages GRPO-based optimization with weighted rewards for format, grounding, and explanation, achieving state-of-the-art performance on benchmarks like RealText.
LogicLens is a unified framework for text-centric image forgery analysis that jointly addresses detection, grounding, and explanation of forgeries via deep visual-logical co-reasoning. It introduces an end-to-end generative objective, advanced chain-of-thought reasoning, novel alignment mechanisms, and a new large-scale dataset, establishing state-of-the-art performance in the automated analysis of text-based image manipulations (Zeng et al., 25 Dec 2025).
1. Joint Visual-Logical Forgery Analysis: Problem and Objective
LogicLens reformulates text-centric forgery analysis as a single generative task. Given an image (such as a scene or document containing text) and prompt , the system must:
- Classify whether is authentic or forged (detection),
- Predict pixel/bounding-box regions of manipulated texts (grounding),
- Generate a free-form rationale for its verdict (explanation).
The model outputs an “analysis report” , where is the verdict, are the detected tampered regions, and is the textual explanation. The objective maximizes the conditional log-likelihood:
where is generated autoregressively, decomposing the likelihood across all output tokens. Unified evaluation aggregates F₁ for detection, mean IoU/mF₁ for grounding, and semantic similarity for explanation into a macro-average score:
This formulation eliminates the siloed handling of these subtasks, instead modeling their interdependence within a unified decoding process.
2. Annotation Pipeline: The PR² (Perceiver, Reasoner, Reviewer) System
High-quality, cognitively-aligned annotations are essential for end-to-end training and evaluation. LogicLens introduces the PR² (Perceiver, Reasoner, Reviewer) multi-agent pipeline:
- Perceiver: Consumes the image (and RGB-mask overlays) plus OCR transcripts; drafts an unstructured forensic analysis noting anomalies, candidate bounding boxes, and preliminary rationales.
- Reasoner: Refines drafts into a structured six-stage Cross-Cue Tree-of-Thought (CCT) outline, assembles a coarse report, and scores format correctness, logical completeness, and grounding consistency.
- Reviewer: Accepts the annotation if the quality score threshold (e.g., 98/100), or issues feedback and loops for up to 3 review cycles.
Accepted outputs are serialized as triples for model training, ensuring consistency and high annotation quality at scale.
3. Cross-Cues-aware Chain of Thought (CCT) Reasoning
At the core of LogicLens is a six-stage Cross-Cues-aware Chain of Thought (CCT), orchestrating deep visual-logical reasoning via iterative cross-modal validation:
- Knowledge Preparation: Aggregates image context, OCR tokens, world-knowledge triggers, and forensic-knowledge cues into .
- Visual Cue Extraction: Extracts features via a vision backbone, producing global consistency () and local anomaly cues (). These are merged into .
- Logical Cue Extraction: Analyzes OCR tokens for internal consistency (e.g., arithmetic, dates) and semantic plausibility, assembling symbolic logical cues , further aggregated into .
- Cross-Cue Validation & Filtering: Applies a learned scoring function to select high-value cues from .
- Grounding: Aligns each high-value cue with the most relevant OCR token and its box, collecting tampered regions into .
- Report Synthesis: Decides the verdict and synthesizes the rationale from the filtered cues, emitting the structured report.
Each stage updates the hidden state by aggregating modality-specific extractions. This pipeline explicitly cross-validates visual and logical evidence, increasing the robustness and interpretability of the system’s verdicts.
4. Weighted Multi-Task Reward and GRPO-Based Optimization
Model alignment and performance are driven by a weighted multi-task reward during RL fine-tuning, combining:
- Format Adherence (): Fraction of required structural tags present in the output.
- Grounding Reward (): Composite of classification match, count-match, and IoU-tiered reward; e.g., , depending on match cardinality, and tiered by mean IoU.
- Explanation Reward (): Cosine similarity between predicted and ground-truth rationale embeddings.
Total reward:
with empirically set weights .
Fine-tuning employs an actor–critic policy-gradient approach (GRPO), updating model parameters according to the reward advantage over a learned baseline . This facilitates holistic optimization of all sub-components.
5. Dataset Construction: RealText Benchmark
RealText is a large-scale dataset constructed to support unified training and evaluation, with the following properties:
- Sourced from COCO-Text, ReCTS, LSVT, OSFT, then filtered via GPT-4o (scene) and DINOv2 (visual quality) to ~50,000 candidates.
- Final selection: 5,397 images (905 dense-text, i.e., ≥10 text lines) with rich annotations: authenticity label , pixel-level/box masks for tampered regions, and free-form textual rationales .
- Manipulation types: copy-move, AIGC inpainting, text replacement, logo swapping, arithmetic/date tampering.
Unlike T-IC13 or T-SROIE, RealText is the first public benchmark at this scale to provide unified detection, grounding, and explanation labels for forensic model training and validation.
6. Empirical Performance
LogicLens demonstrates substantial empirical gains across multiple benchmarks, outperforming generalist and specialist baselines:
| Dataset | Detection F₁ | mF₁ (grounding) | BS-F₁ / CSS (explanation) | Macro-F₁ (M-F₁) |
|---|---|---|---|---|
| RealText (fine-tuned) | 93.2 | 36.7 | 76.9 / 85.0 | 68.9 |
| T-IC13 (zero-shot) | 93.2 | 67.6 | 78.5 | 79.8 |
| T-SROIE (zero-shot dense) | 99.4 | 11.0 | 77.0 | 62.5 |
| FakeShield (M-F₁) | – | – | – | 40.6 |
| GPT-4o (M-F₁) | – | – | – | 55.8 |
| Gemini-2.5-Pro (M-F₁) | – | – | – | 53.9 |
| InternVL-3.5 (M-F₁) | – | – | – | 48.8 |
| Qwen2.5-VL (M-F₁ SFT) | – | – | – | 36.3 |
LogicLens delivers pronounced gains in joint reasoning and grounding, particularly in zero-shot dense-text scenarios.
7. Limitations and Future Directions
LogicLens exhibits several strengths: state-of-the-art macro-F₁, unified reasoning, and interpretable outputs. Limitations include dependency on OCR quality, required balancing of reward weights , and computational cost (~7B parameters) precluding real-time deployment.
Possible future directions include:
- End-to-end trainable OCR and vision backbones for reduced error propagation,
- Extension to video forgeries and temporal consistency reasoning,
- Increased adversarial robustness via CCT augmentation,
- Model distillation for efficient deployment at the edge.
A plausible implication is that LogicLens’s unified, chain-of-thought, and multi-task aligned approach provides a scalable paradigm for multimodal forensic reasoning well beyond current detection-only systems (Zeng et al., 25 Dec 2025).