CAAAv2: Category-Aware Auto-Annotation v2

Updated 4 July 2026

CAAAv2 is a framework that auto-generates pixel-level masks for manipulated regions by splitting the task into SPG (in-place edits) and SDG (object copy) sub-tasks.
It employs branch-specific models—using DASS with absolute difference maps for SPG and Corr-DINO with frozen ViT features for SDG—along with Quality of Example Selection filtering.
CAAAv2 underpins the creation of the large-scale MIMLv2 dataset, providing high-quality annotations that drive significant performance gains in webly supervised image manipulation localization.

Searching arXiv for the main paper and directly related works mentioned in the provided data. CAAAv2, short for Category-Aware Auto-Annotation v2, is a paradigm for pixel-level auto-annotation of manipulated regions in web-harvested image forgery data. It was introduced to address a central bottleneck in image manipulation localization (IML): the high cost of obtaining pixel-precise masks and the resulting scarcity of high-quality annotated datasets. The method operates on pairs of manually forged images and their authentic originals, decomposes constrained image manipulation localization (CIML) into Shared Probe Group (SPG) and Shared Donor Group (SDG) sub-tasks, applies a specialized model to each sub-task, and filters unreliable SDG masks with Quality of Example Selection (QES). In the reported pipeline, CAAAv2 produces the annotations used to construct MIMLv2, a large-scale dataset with 246,212 manually forged images and pixel-level masks, and serves as the data-generation substrate for webly supervised IML training (Qu et al., 28 Aug 2025).

1. Problem setting and design rationale

CAAAv2 is situated in the context of image manipulation localization, where the objective is to identify manipulated regions at the pixel level. The motivating observation is that manually forged images are abundant on the web, whereas dense ground truth annotations are not. The framework therefore treats web data as a supervision source and uses CIML as an auxiliary mechanism for annotation generation rather than relying exclusively on handcrafted mask datasets (Qu et al., 28 Aug 2025).

The specific diagnosis underlying CAAAv2 is that prior CIML approaches suffer from three limitations. First, they treat all forged/original pairs uniformly, despite substantive differences between copy-move-like object transfer and in-place edits. Second, they over-fit small CIML training sets, especially SDG data. Third, they under-utilize the absolute pixel-difference cue that is especially informative in SPG settings. CAAAv2 responds by explicitly splitting CIML into SPG and SDG and assigning each branch a tailored architecture.

In this formulation, SPG denotes pairs that share their main content in-place, while SDG denotes pairs related via object copy. This categorical split is not merely taxonomic; it determines which cues are considered reliable. In SPG, direct appearance differencing is highly informative. In SDG, by contrast, localization depends on cross-image feature correspondence and denoising of spurious matches. A plausible implication is that CAAAv2 treats CIML less as a single segmentation problem and more as a conditional family of problems whose optimal inductive biases differ by edit category.

2. End-to-end auto-annotation pipeline

CAAAv2 takes as input a large collection of manually forged images and their authentic originals, harvested from the web, including sources such as imgur. The pipeline begins with self-supervised SPG/SDG classification, continues with branch-specific mask prediction, and ends with selective filtering of SDG outputs using QES (Qu et al., 28 Aug 2025).

The first stage trains a lightweight binary classifier using synthetic SPG and SDG pairs generated by random augmentations of unlabeled images. Its role is to determine whether an image pair shares its main content in-place or through copied objects. The description emphasizes that this classifier is “simple and reliable” because it only needs to detect whether large areas are identical.

After routing, SPG pairs are processed by Difference-Aware Semantic Segmentation (DASS), while SDG pairs are processed by Correlation DINO (Corr-DINO). The output is a pixelwise manipulation mask $M$ for each pair. Only the SDG branch is passed through the QES filter; SPG branch masks are inherently high quality—no QES needed.

The paper also provides a compact algorithmic summary:

$D(x,y) = |I_a(x,y) - I_b(x,y)|.$ 6

This pipeline is category-aware in a strict operational sense: the category decision changes both the architecture and the quality-control path.

3. SPG branch: Difference-Aware Semantic Segmentation

The SPG branch is implemented as DASS, a semantic segmentation pipeline centered on the explicit use of the absolute difference map between the forged image and its authentic counterpart. For a pair $(I_a, I_b)$ , CAAAv2 computes

$D(x,y) = |I_a(x,y) - I_b(x,y)|.$

The tensors are then concatenated as $[I_a, I_b, D]$ and fed into a semantic-segmentation network with a VAN backbone and a Multi-Aspect Denoiser decoder, producing the per-pixel probability map $M_{\text{spg}}$ (Qu et al., 28 Aug 2025).

The design premise is that SPG pairs preserve scene structure while modifying content in place, making the difference map a strong localization cue. The reported best-practice guidance is explicit: always leverage the absolute difference map in SPG; without it, segmentation IoU drops by more than 20 points. This is one of the clearest methodological claims in the description and frames DASS as a branch whose performance depends on exploiting low-level discrepancies rather than primarily on high-level matching.

Empirically, on the IMD20–SPG test, DASS substantially exceeds the listed baselines. The reported values are IoU = 0.835, F1 = 0.889 for DASS, compared with IoU = 0.497 for OTSU binarization, IoU = 0.578 for DMVN, and IoU = 0.573 for DMAC (Qu et al., 28 Aug 2025). Within the boundaries of the provided data, this establishes that SPG-specific modeling is a major contributor to CAAAv2’s annotation fidelity.

A plausible implication is that SPG localization is dominated by difference denoising rather than semantic retrieval, which helps explain why a segmentation architecture with explicit differencing is preferred over a correspondence-heavy formulation in this branch.

4. SDG branch: Corr-DINO, aggregation, and denoising

The SDG branch addresses the case in which manipulation involves object copy, making raw pixel differencing less reliable. CAAAv2 therefore uses Corr-DINO, which is built around frozen ViT (DINOv2) features, explicit correlation construction, learnable aggregation, Feature Super-Resolution (FSR), and a Multi-Aspect Denoiser (Qu et al., 28 Aug 2025).

Feature extraction is performed from the first four DINOv2 layers:

$F_{a_i}, F_{b_i} \leftarrow \text{ViT (DINOv2)} \text{ layers } i=1\ldots4.$

Correlation features are then formed as

$F_{a\_\text{corr}} = [ \text{Corr}(F_{a1},F_{b1}),\ \text{Corr}(F_{a1},F_{a1}) ],$

$F_{b\_\text{corr}} = [ \text{Corr}(F_{b1},F_{a1}),\ \text{Corr}(F_{b1},F_{b1}) ].$

Here, $\text{Corr}(f,g)$ is the spatial correlation (dot-product) of two feature maps. The branch then applies learnable aggregation:

$F_{\text{aggr}} = \left[ C_{1\times1}(\text{ReLU}(C_{1\times1}(F_{\text{corr}})))\ ;\ \text{AvgPool}(F_{\text{corr}})\ ;\ \text{MaxPool}(F_{\text{corr}}) \right].$

FSR combines aggregated correlation with multi-layer ViT features. Let

$(I_a, I_b)$ 0

Then for scale $(I_a, I_b)$ 1,

$(I_a, I_b)$ 2

These representations are fused by the Multi-Aspect Denoiser in a top-down manner:

$(I_a, I_b)$ 3

$(I_a, I_b)$ 4

This is followed by dilated $(I_a, I_b)$ 5 convolutions and a final $(I_a, I_b)$ 6 convolution for mask logits.

The description identifies one implementation choice as crucial: the frozen ViT backbone in Corr-DINO. The stated reason is to reduce overfitting, since trainable CNN backbones severely overfit SDG data. On the IMD20–SDG test, the reported values are IoU = 0.702 for SACM (conference), IoU = 0.744 for Corr-DINO without QES, and IoU = 0.912 for Corr-DINO with QES, corresponding to an increase of 16.8 points after filtering (Qu et al., 28 Aug 2025). These results indicate that the SDG branch is effective not only because of correspondence modeling but also because its predictions are subjected to post hoc quality control.

5. Quality of Example Selection and dataset distillation

QES is the mechanism used to score SDG auto-annotations and discard masks with spurious regions. Given normalized mask probabilities $(I_a, I_b)$ 7 and two thresholds $(I_a, I_b)$ 8 and $(I_a, I_b)$ 9, CAAAv2 defines

$D(x,y) = |I_a(x,y) - I_b(x,y)|.$ 0

$D(x,y) = |I_a(x,y) - I_b(x,y)|.$ 1

and

$D(x,y) = |I_a(x,y) - I_b(x,y)|.$ 2

In practice, the method uses $D(x,y) = |I_a(x,y) - I_b(x,y)|.$ 3 and retains only examples with QES $D(x,y) = |I_a(x,y) - I_b(x,y)|.$ 4 0.5 (Qu et al., 28 Aug 2025).

Conceptually, QES acts as a no-ground-truth quality filter. It measures how concentrated the predicted mask mass is in highly confident regions relative to all pixels above a lower threshold. This suggests that masks dominated by diffuse or weak activations are treated as unreliable. The paper explicitly notes a precision/coverage trade-off: higher QES yields better masks but fewer retained samples.

The reported threshold study on SDG on IMD20 is as follows:

QES thr	kept ratio	resulting IoU
0.0	245%	0.744
0.3	172%	0.865
0.5	100%	0.912
0.7	50%	0.957
0.9	6%	0.968

The table makes the trade-off explicit: moving from 0.5 to 0.9 raises IoU from 0.912 to 0.968, but reduces the kept ratio from 100% to 6%. Conversely, lowering the threshold preserves many more examples but at sharply reduced quality. Within CAAAv2, the selected operating point is QES $D(x,y) = |I_a(x,y) - I_b(x,y)|.$ 5 0.5, which balances dataset scale and annotation fidelity (Qu et al., 28 Aug 2025).

6. Role in MIMLv2 and downstream webly supervised IML

CAAAv2 is the annotation engine used to construct MIMLv2. The reported workflow is: collect approximately 310 K forged/original pairs from imgur, excluding duplicates and evaluation-set images; auto-annotate via CAAAv2 to obtain approximately 370 K raw masks; apply the QES filter on the SDG branch; and obtain a final dataset with 246,212 forged images and 63,847 authentic originals, each with a pixel-accurate mask (Qu et al., 28 Aug 2025).

The dataset is described as about 120× larger than IMD20 and as more diverse and modern. The abstract further characterizes MIMLv2 as a large-scale, diverse, and high-quality dataset. In the broader system, these annotations are used together with Object Jitter, described as a technique that further enhances model training by generating high-quality manipulation artifacts, to train Web-IML, a model designed to leverage web-scale supervision for the IML task.

The downstream impact reported for the full webly supervised pipeline is substantial. Training Web-IML on MIMLv2 (+Object Jitter) yields an overall +31% average IoU gain across 8 benchmarks and outperforms prior best by 24.1 IoU points. The abstract also states that Web-IML surpasses previous SOTA TruFor by 24.1 average IoU points (Qu et al., 28 Aug 2025). While these gains belong to the end-to-end framework rather than to CAAAv2 in isolation, they identify the practical significance of the auto-annotation paradigm: CAAAv2 converts abundant but unlabeled web forgeries into training-grade supervision at scale.

7. Operational guidance, scope, and limitations

Several implementation recommendations are given explicitly. The frozen ViT backbone in Corr-DINO is described as crucial for preventing SDG overfitting. The self-supervised SPG/SDG classifier is described as sufficient because it only needs to identify whether large image areas are identical. For SPG, the absolute difference map is essential; omitting it reduces segmentation IoU by more than 20 points. For SDG, QES is effective but should be tuned with awareness of the precision/coverage trade-off (Qu et al., 28 Aug 2025).

The stated limitations define the current scope of CAAAv2. Extremely small or heavily compressed forgeries can still evade both DASS and Corr-DINO. In addition, content outside the SPG/SDG taxonomy, such as adversarial manipulations, may require additional modeling. These limitations are structural rather than incidental: the system is optimized for the two CIML categories it models explicitly, and its quality filter assumes that reliable SDG masks exhibit a characteristic confidence concentration.

A common misconception would be to treat CAAAv2 as a generic weak-supervision recipe for all manipulation types. The provided description does not support that conclusion. Instead, CAAAv2 is specifically a category-aware auto-annotation framework whose effectiveness depends on a decomposition into SPG and SDG, on branch-specific architectural choices, and on post hoc filtering of SDG predictions. A plausible implication is that extending the method to new manipulation regimes would require either a richer taxonomy or additional branch-specific models rather than only more web data.

Taken together, CAAAv2 defines a technically specific route from web-harvested forged/original pairs to pixel-accurate masks: self-supervised category assignment, difference-aware segmentation for in-place edits, correlation-based frozen-feature localization for copied-content edits, and quality-aware filtering for noisy outputs. In the reported system, this combination is what enables MIMLv2 and the subsequent performance gains in webly supervised IML (Qu et al., 28 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Webly-Supervised Image Manipulation Localization via Category-Aware Auto-Annotation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CAAAv2.