LogicLens: Unified Forgery Analysis
- LogicLens is a unified framework for advanced text-centric forgery analysis that integrates visual and logical reasoning to detect and explain image manipulations.
- It employs a Cross-Cues-aware Chain of Thought mechanism and a weighted multi-task reinforcement learning strategy to optimize detection, region grounding, and explanation synthesis.
- The system is validated on the RealText benchmark using a PR² hierarchical pipeline, achieving state-of-the-art Macro-F1 scores in both fine-tuned and zero-shot settings.
LogicLens is a unified multimodal framework designed for advanced text-centric forgery analysis, integrating visual and logical reasoning within a single generative process. Developed to address the escalating sophistication of text-centric image manipulations enabled by generative AI, LogicLens jointly optimizes detection of forgeries, region grounding, and explanation synthesis. The architecture is distinguished by its Cross-Cues-aware Chain of Thought (CCT) mechanism for deep visual-logical co-reasoning, a weighted multi-task reward system for reinforcement alignment, and the hierarchical PR² annotation pipeline, validated on the large-scale RealText benchmark (Zeng et al., 25 Dec 2025).
1. Joint Problem Formulation
LogicLens reframes text-centric forgery analysis as a joint generative task over an image and prompt , producing a structured report , where is the authenticity verdict, localizes forged regions, and is a natural-language rationale. Model parameters are learned by maximizing the expected log-likelihood:
where is generated token-wise in an autoregressive fashion. Evaluation aggregates detection , grounding mean F1 (), and explanation semantic similarity into Macro-F1 ():
This joint paradigm contrasts with prior decoupled approaches and is foundational for LogicLens’s unified reasoning capabilities (Zeng et al., 25 Dec 2025).
2. PR² Hierarchical Data-Curation Pipeline
To enable high-fidelity supervision, LogicLens employs the PR² pipeline—a multi-agent system comprising Perceiver, Reasoner, and Reviewer—prior to training:
- Perceiver ingests raw images, fused RGB-mask visualizations, and OCR transcripts, yielding preliminary forensic analyses: anomalies, bounding boxes, and draft rationales.
- Reasoner structures these drafts into a six-stage CCT outline, evaluating format, logic, and grounding accuracy, producing a scalar quality score.
- Reviewer applies QA, iteratively requesting corrections or enhancements until a quality threshold is met.
This pipeline ensures cognitively-aligned, fully-annotated samples, encoded as triples for model fine-tuning and policy-gradient rewards (Zeng et al., 25 Dec 2025).
3. Cross-Cues-aware Chain of Thought (CCT) Reasoning
CCT is the deep reasoning core of LogicLens, structured in six interdependent stages with chain states :
- Knowledge Preparation: Aggregates image context, OCR tokens , and relevant forensic/semantic triggers.
- Visual Cue Extraction: Computes global () and local () anomaly cues via vision backbones, aggregated into .
- Logical Cue Extraction: Processes symbolic logical cues from OCR, checking arithmetic and contextual consistency, updating .
- Cross-Cue Validation & Filtering: Applies a learned salience scorer over visual and logical cues, selecting high-value elements for .
- Grounding: Matches selected cues to OCR tokens and regions, constructing the tampered region set and updating .
- Report Synthesis: Decides verdict and synthesizes rationale for token-wise report generation.
At each step, modalities are aggregated as:
where is the stage-specific modality extractor. CCT enables robust cross-validation between observed visual evidence and logical structure, critical for precise forgery identification (Zeng et al., 25 Dec 2025).
4. Weighted Multi-Task Reinforcement Alignment
After supervised pretraining, LogicLens applies GRPO-based reinforcement learning using a composite, weighted multi-task reward:
with empirically set weights .
- : Presence of structural tags in output.
- : Sum of detection, region count, and IoU-based rewards; e.g., for mIoU , .
- : Cosine similarity of explanation sentence embeddings.
Policy-gradient actor–critic optimization maximizes expected , with a learned value baseline for variance reduction:
1 2 3 4 5 6 7 8 |
for each (I,T): y ~ P_θ(·|I,T) R_total = ... A = R_total - V_φ(I,T) g_θ += ∇_θ log P_θ(y|I,T) · A g_φ += ∂ (V_φ(I,T)-R_total)² /∂φ θ += α_θ · Normalize(g_θ) φ -= α_φ · g_φ |
This reward structure tightly couples detection, grounding, and explanation, enhancing holistic model performance (Zeng et al., 25 Dec 2025).
5. RealText Benchmark Dataset
LogicLens’s supervised and RL training leverages RealText, a dataset produced via the PR² pipeline starting from ~50K candidate images filtered with GPT-4o and DINOv2. The final RealText dataset comprises 5,397 images with:
- authenticity labels
- pixel-level bounding boxes for all tampered text regions
- free-form explanations identifying both visual and logical anomalies
Forged images include copy-move, AIGC inpainting, text replacement, logo swapping, and arithmetic/date tampering, with a “dense-text” subset (≥ 10 text-lines, 905 images). RealText establishes the first unified and large-scale multi-task benchmark for text-centric forensic analysis, surpassing the scope and granularity of T-IC13 and T-SROIE (Zeng et al., 25 Dec 2025).
6. Experimental Evaluation and Comparative Results
LogicLens is evaluated on RealText (fine-tuned), T-IC13 (zero-shot), and T-SROIE (zero-shot dense). Macro-F1 is the principal aggregate metric.
| Dataset | Detection F₁ | Grounding mF₁ | Explanation BS-F₁ | Macro-F₁ |
|---|---|---|---|---|
| RealText (FT) | 93.2 | 36.7 | 76.9 | 68.9 |
| T-IC13 (ZS) | 93.2 | 67.6 | 78.5 | 79.8 |
| T-SROIE (ZS, dense) | 99.4 | 11.0 | 77.0 | 62.5 |
| FakeShield | — | — | — | 40.6 |
| GPT-4o | — | — | — | 55.8 |
| Gemini-2.5-Pro | — | — | — | 53.9 |
| InternVL-3.5 | — | — | — | 48.8 |
| Qwen2.5-VL (SFT) | — | — | — | 36.3 |
LogicLens demonstrates clear superiority in joint reasoning and grounding, with strong zero-shot generalization to unseen datasets (Zeng et al., 25 Dec 2025).
7. Strengths, Limitations, and Prospects
LogicLens advances text-centric forgery analysis via generative integration of detection, grounding, and explanation, deep co-reasoning through CCT, reward-aligned RL, and benchmarking with RealText. Principal strengths include state-of-the-art Macro-F1 performance, interpretable and structured output, and adaptability to forensic workflows.
Limitations include dependency on OCR quality and world-knowledge recall, the necessity of precise reward tuning, and inference cost due to model scale (~7B parameters). Future work aims to implement end-to-end OCR and vision backbones, extend modality coverage (e.g., video, multi-frame consistency), reinforce adversarial robustness, and enable efficient distillation for edge deployment (Zeng et al., 25 Dec 2025).
In aggregate, LogicLens establishes a comprehensive paradigm for visual-logical co-reasoning in text-centric forgery analysis, unifying multiple subtasks under a reinforcement-aligned multimodal architecture, anchored by a cognitively-validated, large-scale annotated benchmark.