ForenAgent: Iterative Forensic Analysis

Updated 25 December 2025

ForenAgent is a multi-round forensic paradigm for image forgery detection that uses iterative, tool-based reasoning with a multimodal LLM and Python sandbox.
The framework employs a sequential process of global perception, local focusing, iterative probing, and holistic adjudication to integrate low-level forensic clues.
It combines supervised instruction tuning with reinforcement fine-tuning, achieving state-of-the-art accuracy and F1 scores on benchmarks like FABench.

ForenAgent refers to a multi-round agentic forensic analysis paradigm, exemplified in recent research for image forgery detection. It operationalizes forensic reasoning via an iterative process in which a multimodal LLM (MLLM) autonomously generates, executes, and refines tool-based Python code to guide low-level investigative actions, combining global and local evidence streams into coherent, interpretable verdicts (Zhang et al., 18 Dec 2025).

1. System Architecture and Reasoning Loop

ForenAgent is architected around a multimodal LLM backbone (e.g., Qwen2.5-VL-7B) coupled with a Python execution sandbox that hosts a compact “forensic toolbox” of image-processing primitives. The core interaction follows a “think-and-code” loop that iterates four principal stages inspired by human forensic workflows:

Global Perception: Invocation of global, low-level forensic tools (e.g., DCT high-pass, FFT residual) on entire images to obtain broad-spectrum cues.
Local Focusing: Based on global signal, the agent dynamically localizes interest via code-generated Crop tool calls, enabling attention on candidate tampered regions.
Iterative Probing: Specialized forensic routines (e.g., JPEG Ghost, SRM noise residual) are targeted at focused patches to identify micro-level anomalies.
Holistic Adjudication: The agent aggregates all intermediate artifacts and tool outputs to deliver a structured authenticity verdict (using a formal <answer> tag) and, if appropriate, a mask or coordinates delimiting the tampered segment.

Each step interleaves image/context input, generation of code by the LLM (e.g., foren_tools utilities in Python), sandboxed execution, and feedback into the reasoning context, repeating until solution or budget exhaustion (Zhang et al., 18 Dec 2025).

2. Training Methodology and Reward Structure

ForenAgent is trained under a two-phase regime:

Cold Start (Supervised Instruction-Tuning): The backbone LLM is fine-tuned over approximately 200k agent-interaction question/answer pairs from the FABench dataset. The objective is standard cross-entropy token-level prediction, with an emphasis on both reasoning traces (code+justification) and final answers:

$L_{cold}(\theta) = -\sum_{(x,y)\in D} \log P_\theta(y \mid x)$

Reinforcement Fine-Tuning (RFT) via Group Relative Policy Optimization (GRPO): The policy $\pi_\theta$ $π_{θ}$ is updated to maximize a reward $R(\tau)$ $R (τ)$ defined for entire reasoning trajectories $\tau$ $τ$ . The reward dissects into correctness on final output, compliance with syntactic constraints (valid <code> and <answer> blocks), and a weighted composite of tool-use metrics, including:
- Global Forensic Priority: Early use of low-level forensic utilities is encouraged, with time-decayed weighting.
- Tool Logic: Syntactic and API-correctness of code.
- Crop Sensitivity: Positive reward if the agent performs cropping in scenarios meriting localization.
- Reasoning Coherence: Sequential tool use and chaining across steps.

$R(\tau) = \lambda_{acc}\,R_{acc}(\tau) + \lambda_{format}\,R_{format}(\tau) + \lambda_{tool}\,R_{tool}(\tau)$

This methodology explicitly scaffolds flexible, tool-based multi-stage reasoning.

3. Forensic Toolbox and Tool Usage

The Python-based forensic toolbox (“foren_tools”) controlled by the agent distinguishes between:

Basic Image Processing Tools: Cropping, contrast/brightness adjustment, etc.
Low-Level Forensics Primitives: Twelve specialized routines, such as FFT High-Frequency Residual, Discrete Wavelet Transform subbands, Resampling Periodicity Detector, DCT-based High-Pass, SRM Noise Residual, Bayar Constrained Convolution, PRNU Fingerprint Correlation, Sobel Edges, JPEG Ghost detection, Median-Filter Trace, and Local Correlation Map.

At each reasoning turn, the agent emits code similar to:

from foren_tools import jpeg_ghost
mask = jpeg_ghost(image_patch)
import matplotlib.pyplot as plt
plt.imshow(mask, cmap='hot')
plt.show()

The executed output is reintegrated into the context for subsequent steps, guiding the agent’s iterative refinement (Zhang et al., 18 Dec 2025).

4. Dataset Construction and Evaluation Protocols

ForenAgent is benchmarked using FABench, a large-scale, agent-forensics dataset comprising 100k images (split into authentic, synthetic, and tampered categories) and approximately 200k code-augmented question/answer trajectories. Images are drawn as follows:

Authentic: 40k from COCO.
Synthetic: 30k—COCO captions synthesized using seven different generative models.
Tampered: 30k—mask-guided inpaints over COCO source images.

Quality control facets include resolution checks, file-integrity audits, and human vetting. The QA pipeline uses GPT-4.1 prompted over images to yield verified, executable reasoning traces with accurate localizations. Standard image-level Accuracy and F1 are computed for quantitative evaluation (Zhang et al., 18 Dec 2025).

5. Performance Benchmarks and Diagnostic Analysis

ForenAgent achieves state-of-the-art scores on FABench detection tasks:

Method	Overall Accuracy (%)	Overall F1 (%)
UnivFD (CVPR’23)	81.1	80.9
SIDA (CVPR’25)	76.7	76.5
Qwen2.5-VL-7B (sup.)	79.7	79.7
ForenAgent	88.1	88.2

Category-wise performance is robust, with ForenAgent attaining 93.3% Accuracy / 89.4% F1 on authentic, 91.3% / 87.1% on synthetic, and 92.1% / 88.0% on tampered images. On the SIDA-Test set, ForenAgent generalizes with 80.6% Accuracy, surpassing SIDA at 77.2% (Zhang et al., 18 Dec 2025).

Ablation studies indicate that both Cold Start and Reinforcement Fine-Tuning contribute substantially to performance (Cold Start lifts >8 points over baseline, RFT boosts >6 points). Purposeful tool sequencing, particularly global-to-local probing, is shown to account for an additional ~4.6 point improvement. Usage analysis reveals SRM, FFT, and JPEG Ghost as the most frequently invoked forensic primitives; synthetic cases average three tool calls, tampered cases four.

6. Interpretability, Emergent Behaviors, and Implications

ForenAgent’s process yields full transparency via explicit evidence chains anchored in executable code, enabling verification and introspection. Qualitative analyses document the agent’s emergent self-reflection, such as correcting initial mislocalizations based on intermediate outputs. This framework demonstrates that the dynamic composition of human-like, iterative tool use within a code-in-the-loop paradigm is critical for interpretable and adaptable forensic intelligence.

A plausible implication is that the ForenAgent approach, by leveraging direct integration of domain-specific toolchains under LLM control, charts a viable trajectory toward general-purpose, explainable, and auditable digital forensics, in image analysis and potentially in broader multimodal investigative domains (Zhang et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ForenAgent.