Papers
Topics
Authors
Recent
2000 character limit reached

LogicLens: Unified Forgery Analysis

Updated 1 January 2026
  • LogicLens is a unified framework for advanced text-centric forgery analysis that integrates visual and logical reasoning to detect and explain image manipulations.
  • It employs a Cross-Cues-aware Chain of Thought mechanism and a weighted multi-task reinforcement learning strategy to optimize detection, region grounding, and explanation synthesis.
  • The system is validated on the RealText benchmark using a PR² hierarchical pipeline, achieving state-of-the-art Macro-F1 scores in both fine-tuned and zero-shot settings.

LogicLens is a unified multimodal framework designed for advanced text-centric forgery analysis, integrating visual and logical reasoning within a single generative process. Developed to address the escalating sophistication of text-centric image manipulations enabled by generative AI, LogicLens jointly optimizes detection of forgeries, region grounding, and explanation synthesis. The architecture is distinguished by its Cross-Cues-aware Chain of Thought (CCT) mechanism for deep visual-logical co-reasoning, a weighted multi-task reward system for reinforcement alignment, and the hierarchical PR² annotation pipeline, validated on the large-scale RealText benchmark (Zeng et al., 25 Dec 2025).

1. Joint Problem Formulation

LogicLens reframes text-centric forgery analysis as a joint generative task over an image II and prompt TT, producing a structured report R=(c,B,E)R = (c, B, E), where cc is the authenticity verdict, BB localizes forged regions, and EE is a natural-language rationale. Model parameters θ\theta are learned by maximizing the expected log-likelihood:

θ^=argmaxθ  E(I,T,R)D[logPθ(RI,T)],\hat \theta = \arg\max_\theta \; \mathbb{E}_{(I,T,R)\sim \mathcal D}\bigl[\log P_\theta(R \mid I,T)\bigr],

where RR is generated token-wise in an autoregressive fashion. Evaluation aggregates detection F1F_1, grounding mean F1 (mF1\text{mF}_1), and explanation semantic similarity into Macro-F1 (M-F1\text{M-F}_1):

M-F1=13(F1detection+F1grounding+F1explanation).\text{M-F}_1 = \frac{1}{3}(F_1^\text{detection} + F_1^\text{grounding} + F_1^\text{explanation}).

This joint paradigm contrasts with prior decoupled approaches and is foundational for LogicLens’s unified reasoning capabilities (Zeng et al., 25 Dec 2025).

2. PR² Hierarchical Data-Curation Pipeline

To enable high-fidelity supervision, LogicLens employs the PR² pipeline—a multi-agent system comprising Perceiver, Reasoner, and Reviewer—prior to training:

  • Perceiver ingests raw images, fused RGB-mask visualizations, and OCR transcripts, yielding preliminary forensic analyses: anomalies, bounding boxes, and draft rationales.
  • Reasoner structures these drafts into a six-stage CCT outline, evaluating format, logic, and grounding accuracy, producing a scalar quality score.
  • Reviewer applies QA, iteratively requesting corrections or enhancements until a quality threshold is met.

This pipeline ensures cognitively-aligned, fully-annotated samples, encoded as (c,B,E)(c, B, E) triples for model fine-tuning and policy-gradient rewards (Zeng et al., 25 Dec 2025).

3. Cross-Cues-aware Chain of Thought (CCT) Reasoning

CCT is the deep reasoning core of LogicLens, structured in six interdependent stages with chain states hkh^k:

  1. Knowledge Preparation: Aggregates image context, OCR tokens {wi}\{w_i\}, and relevant forensic/semantic triggers.
  2. Visual Cue Extraction: Computes global (vgv_g) and local (vlv_l) anomaly cues via vision backbones, aggregated into h2h^2.
  3. Logical Cue Extraction: Processes symbolic logical cues LL from OCR, checking arithmetic and contextual consistency, updating h3h^3.
  4. Cross-Cue Validation & Filtering: Applies a learned salience scorer s()s(\cdot) over visual and logical cues, selecting high-value elements for h4h^4.
  5. Grounding: Matches selected cues to OCR tokens and regions, constructing the tampered region set BB and updating h5h^5.
  6. Report Synthesis: Decides verdict cc and synthesizes rationale EE for token-wise report generation.

At each step, modalities are aggregated as:

hk+1=Aggregate(hk,Mk(hk)),h^{k+1} = \mathrm{Aggregate}(h^k, M^k(h^k)),

where MkM^k is the stage-specific modality extractor. CCT enables robust cross-validation between observed visual evidence and logical structure, critical for precise forgery identification (Zeng et al., 25 Dec 2025).

4. Weighted Multi-Task Reinforcement Alignment

After supervised pretraining, LogicLens applies GRPO-based reinforcement learning using a composite, weighted multi-task reward:

Rtotal=λfRformat+λgRground+λeRexplain,R_{\text{total}} = \lambda_f R_{\text{format}} + \lambda_g R_{\text{ground}} + \lambda_e R_{\text{explain}},

with empirically set weights (λf,λg,λe)=(0.15,0.75,0.10)(\lambda_f,\lambda_g,\lambda_e)=(0.15, 0.75, 0.10).

  • RformatR_{\text{format}}: Presence of structural tags in output.
  • RgroundR_{\text{ground}}: Sum of detection, region count, and IoU-based rewards; e.g., for mIoU >0.8> 0.8, Riou=0.6R_{\text{iou}} = 0.6.
  • RexplainR_{\text{explain}}: Cosine similarity of explanation sentence embeddings.

Policy-gradient actor–critic optimization maximizes expected RtotalR_{\text{total}}, with a learned value baseline VϕV_\phi for variance reduction:

1
2
3
4
5
6
7
8
for each (I,T):
    y ~ P_θ(·|I,T)
    R_total = ...
    A = R_total - V_φ(I,T)
    g_θ += _θ log P_θ(y|I,T) · A
    g_φ +=  (V_φ(I,T)-R_total)² /φ
θ += α_θ · Normalize(g_θ)
φ -= α_φ · g_φ

This reward structure tightly couples detection, grounding, and explanation, enhancing holistic model performance (Zeng et al., 25 Dec 2025).

5. RealText Benchmark Dataset

LogicLens’s supervised and RL training leverages RealText, a dataset produced via the PR² pipeline starting from ~50K candidate images filtered with GPT-4o and DINOv2. The final RealText dataset comprises 5,397 images with:

  • authenticity labels c{authentic,forged}c \in \{\text{authentic}, \text{forged}\}
  • pixel-level bounding boxes for all tampered text regions
  • free-form explanations identifying both visual and logical anomalies

Forged images include copy-move, AIGC inpainting, text replacement, logo swapping, and arithmetic/date tampering, with a “dense-text” subset (≥ 10 text-lines, 905 images). RealText establishes the first unified and large-scale multi-task benchmark for text-centric forensic analysis, surpassing the scope and granularity of T-IC13 and T-SROIE (Zeng et al., 25 Dec 2025).

6. Experimental Evaluation and Comparative Results

LogicLens is evaluated on RealText (fine-tuned), T-IC13 (zero-shot), and T-SROIE (zero-shot dense). Macro-F1 is the principal aggregate metric.

Dataset Detection F₁ Grounding mF₁ Explanation BS-F₁ Macro-F₁
RealText (FT) 93.2 36.7 76.9 68.9
T-IC13 (ZS) 93.2 67.6 78.5 79.8
T-SROIE (ZS, dense) 99.4 11.0 77.0 62.5
FakeShield 40.6
GPT-4o 55.8
Gemini-2.5-Pro 53.9
InternVL-3.5 48.8
Qwen2.5-VL (SFT) 36.3

LogicLens demonstrates clear superiority in joint reasoning and grounding, with strong zero-shot generalization to unseen datasets (Zeng et al., 25 Dec 2025).

7. Strengths, Limitations, and Prospects

LogicLens advances text-centric forgery analysis via generative integration of detection, grounding, and explanation, deep co-reasoning through CCT, reward-aligned RL, and benchmarking with RealText. Principal strengths include state-of-the-art Macro-F1 performance, interpretable and structured output, and adaptability to forensic workflows.

Limitations include dependency on OCR quality and world-knowledge recall, the necessity of precise reward tuning, and inference cost due to model scale (~7B parameters). Future work aims to implement end-to-end OCR and vision backbones, extend modality coverage (e.g., video, multi-frame consistency), reinforce adversarial robustness, and enable efficient distillation for edge deployment (Zeng et al., 25 Dec 2025).

In aggregate, LogicLens establishes a comprehensive paradigm for visual-logical co-reasoning in text-centric forgery analysis, unifying multiple subtasks under a reinforcement-aligned multimodal architecture, anchored by a cognitively-validated, large-scale annotated benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to LogicLens.