BoxPromptIML: Efficient Image Manipulation Localization

Updated 2 December 2025

BoxPromptIML is a weakly supervised learning approach for image manipulation localization that uses bounding box annotations and pseudo-masks generated by a promptable segmentation teacher.
It employs a student-teacher distillation framework with an active memory feature fusion mechanism, achieving high accuracy while drastically reducing annotation cost.
Experimental results demonstrate that BoxPromptIML surpasses traditional methods with efficient computation and competitive F1 scores on both in-distribution and out-of-distribution datasets.

BoxPromptIML refers to a weakly supervised learning paradigm for Image Manipulation Localization (IML) that leverages bounding box annotations—rather than full segmentation masks—to generate accurate manipulation masks via promptable vision foundation models (VFMs). This approach is designed to drastically reduce annotation cost and computational overhead for pixel-level instance localization, while maintaining or exceeding the accuracy and robustness of fully supervised and other weakly supervised methods. The core workflow involves generating pseudo-masks from bounding boxes using a promptable segmentation teacher (typically SAM), and then distilling these pseudo-labels into a compact, deployable student model via an active memory feature fusion mechanism (Guo et al., 25 Nov 2025).

1. Motivation and Problem Setting

IML demands fine-grained spatial localization for detecting manipulated regions, but standard fully supervised frameworks depend on dense, labor-intensive mask annotation, often requiring up to 23 minutes per image. Conventional weakly supervised techniques rely on image-level labels, providing minimal spatial information and yielding poor localization quality. BoxPromptIML exploits bounding box prompts as a middle ground—annotation of a rough box around a manipulated region typically takes only about 7 seconds, providing essential spatial cues while incurring negligible annotation overhead compared to pixel masks (Guo et al., 25 Nov 2025).

2. Coarse Region Annotation and Pseudo-Mask Generation

The annotation process in BoxPromptIML consists of delineating coarse bounding boxes $B = [x_1, y_1, x_2, y_2]$ over tampered areas in an image $I \in \mathbb{R}^{H \times W \times 3}$ . These boxes serve as input prompts to a frozen SAM teacher, which encodes $B$ via its prompt encoder and then produces a high-resolution binary manipulation mask $\hat M_{teacher} \in \{0, 1\}^{H \times W}$ through the mask decoder:

$\hat M_{teacher} = \mathrm{SAM}(I, B)$

No further human mask correction is required—these pseudo-masks serve as learning targets ("soft labels") for the student network during training. This pseudo-labeling mechanism allows the annotation burden to be shifted from experts to the automatic capacity of the frozen promptable model, assuming the latter provides sufficiently accurate masks (Guo et al., 25 Nov 2025).

3. Student–Teacher Knowledge Distillation Framework

BoxPromptIML employs a frozen SAM as the teacher and distills its pseudo-label outputs into a lightweight student. The student model comprises a Tiny-ViT backbone (four-stage, multi-scale, 5.5 M parameters, 1.4 G FLOPs, input size $224 \times 224$ ), followed by a decoder based on the Memory-Guided Gated Fusion Module (MGFM).

The key training objective for the student is a binary cross-entropy loss between its refined attention mask $A_{refined}$ and the teacher's pseudo-mask:

$\mathcal{L}_{distill} = -\sum_{i,j} [\hat M_{teacher}^{ij} \log A_{refined}^{ij} + (1 - \hat M_{teacher}^{ij}) \log (1 - A_{refined}^{ij})]$

No temperature scaling or feature-map alignment is required beyond this loss. The student model, after distillation, is independent of bounding box prompts at inference time and generates manipulation masks directly from the input image (Guo et al., 25 Nov 2025).

4. Active Memory Feature Fusion Mechanism

The MGFM is biologically inspired by human memory systems, combining short-term gate maps (per-image, localizes tampered areas in real time) and long-term prototypical attention patterns (memory bank, updated by batch statistics). The module operates by (a) generating gated feature integrations for four aligned multi-scale feature maps, (b) aggregating gate priors, and (c) fusing real-time base attention with prototypes from the memory bank:

Gated Integration: For each feature $F'_i$ , a gate $G_i = \sigma(\mathrm{Conv}_{1 \times 1}(F'_i))$ is computed and combined with other gates/features.
Prototype Memory: The memory bank $M$ (of size $K$ ) stores batch-averaged attention maps, updated as

$M \leftarrow momentum \cdot M + (1 - momentum) \cdot \mathrm{Average}(A'_{base})$

Final Attention Fusion: Combines base attention, gate priors, and memory prototype:

$A_{final} = \alpha (A'_{base} \odot G_{avg}) + (1-\alpha) \bar A_{mem}, \quad \alpha \in [0, 1]$

Mask Refinement: The final segmentation mask $A_{refined}$ is generated by adding $A_{final} \odot F_{fused}$ to the fused feature map.

This dual-guidance (Editor's term) enables the student to recall context-dependent priors for robust localization even on challenging, out-of-distribution examples (Guo et al., 25 Nov 2025).

5. Training and Inference Workflow

The complete BoxPromptIML training pipeline features:

Preprocessing: Input images and box annotations are loaded.
Pseudo-mask generation: For each batch, the SAM teacher computes binary masks corresponding to boxes, with no further user intervention.
Student Forward Pass: The student backbone processes the image; MGFM performs multi-scale, memory-augmented fusion for mask prediction.
Loss and Optimization: Binary cross-entropy between the predicted and teacher masks; AdamW optimizer (learning rate $1 \times 10^{-4}$ , weight decay $0.05$); typical settings are 20 epochs, batch size 16, with data augmentations (flip, crop, color jitter).
Inference: Once trained, the student mask predictor runs box-free inference for efficient deployment on new data.

At test time, the pipeline operates entirely without prompt inputs or access to the teacher, enabling low-latency, stand-alone deployment (Guo et al., 25 Nov 2025).

6. Experimental Results and Analysis

Quantitative evaluation demonstrates that BoxPromptIML achieves state-of-the-art results on a wide range of IML datasets. On four in-distribution sets (CASIAv2, Coverage, Columbia, NIST16), BoxPromptIML (weak supervision via coarse boxes) realized an average F1 score of 0.619 after 20 epochs, outperforming several fully supervised baselines (TruFor, PSCC-Net, SparseVit). On out-of-distribution datasets (CocoGlide, In-the-Wild, Korus, DSO, IMD2020), BoxPromptIML achieves OOD-F1 = 0.285, matching or exceeding many full-supervised approaches.

Compared to other weakly supervised methods, BoxPromptIML substantially narrows the performance gap to full supervision, with F1 0.619 versus 0.239–0.400 for WSCL, EdgeCAM, SOWCL, or SCAF.

BoxPromptIML is also highly efficient; it requires only 5.5 M parameters and 1.4 G FLOPs (outperforming PSCC-Net with 3.7 M parameters but 45.7 G FLOPs at much lower F1). Analysis under real-world media compression (Facebook/Weibo/WeChat/WhatsApp) confirms robustness: BoxPromptIML maintains F1 ≈ 0.47–0.53, surpassing the majority of full- or weakly supervised competitors (Guo et al., 25 Nov 2025).

7. Limitations and Future Directions

BoxPromptIML inherits certain limitations from its dependency on the SAM teacher. Inaccuracies or biases in the pseudo-masks may propagate to the student network. The current implementation utilizes a fixed memory bank for the MGFM; dynamic, class- or dataset-specific memory mechanisms could potentially enhance adaptation and generalization.

The approach is presently optimized for bounding box supervision. Extensions could include accepting alternative weak prompts (scribbles, points), adapting to novel domains with zero or few labels, or leveraging multi-teacher distillation strategies combining several promptable segmentation VFM sources (e.g., SAM variants, MedSAM2, Uni-Segmenter).

The central contribution of BoxPromptIML lies in its practical, annotation-efficient pipeline that reaches or surpasses fully supervised performance in localization tasks, without the cost or computational inertia associated with dense manual masks or large backbone deployments (Guo et al., 25 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to BoxPromptIML.