AIGCDetectBenchmark: Evaluating AIGC Detectors
- AIGCDetectBenchmark is a rigorous evaluation framework that measures detectors' cross-model generalization, fairness, and robustness using diverse synthetic and real image datasets.
- It employs specialized detection protocols, including artifact-based and transformer methods, to assess performance under identity-preserving edits and demographic shifts.
- Key metrics such as accuracy, AUC, AP, and EER reveal that artifact-driven methods excel while challenges remain in achieving universal fairness and robustness.
AIGCDetectBenchmark is a rigorously constructed evaluation benchmark designed to measure the cross-model generalization, fairness, and robustness of state-of-the-art (SOTA) AI-generated image content (AIGC) detectors. Developed to address the limitations of detectors operating in a rapidly evolving generative landscape, AIGCDetectBenchmark provides both broad coverage of generator families (GANs, diffusion models, business APIs) and fine-grained challenge regimes—specifically targeting scenarios such as identity-preserving edits and out-of-distribution demographic attributes. Its design reflects the growing demand for detectors applicable to unseen or adversarial AIGC sources, as well as for equitable performance across demographic subgroups (Dubey et al., 2 Dec 2025, Yan et al., 2024, Zhong et al., 2023, Hu et al., 18 Jan 2026, Chu et al., 24 Nov 2025).
1. Benchmark Construction and Dataset Organization
AIGCDetectBenchmark aggregates synthetic images spanning up to 17 generative categories—including ProGAN, StyleGAN(2), BigGAN, CycleGAN, StarGAN, GauGAN, WFIR, modern diffusion engines (e.g., GLIDE, ADM, Midjourney, Stable Diffusion v1.4/v1.5/SDXL, VQDM, DALL·E 2), and business API models (e.g., Wukong). Pairing with domain-matched real images, each category is represented by thousands to tens of thousands of samples, yielding a large, balanced, and diverse dataset: for instance, one canonical construction includes 360,000 ProGAN fakes and 360,000 LSUN reals for training, with zero-shot evaluation on all other 16–17 categories, ensuring strict source separation between training and test splits (Zhong et al., 2023, Hu et al., 18 Jan 2026, Yan et al., 2024).
Specialized extensions, such as the Indian IP-AIGC scenario (Dubey et al., 2 Dec 2025), introduce further demographic controls: FairFD-Indian (10,308 real/61,678 fake), HAV-DF (2,444/3,759), and held-out test suites targeting Indian/South-Asian faces in both conventional and identity-preserving edit regimes. All samples are preprocessed via landmark alignment; synthetic variants are generated with prompts that modulate context but preserve core identity attributes (bone structure, skin tone).
Preprocessing protocols—such as Smash & Reconstruction (patch-based scrambling with high-pass filtering) and Learnable Frequency Attention (LFA) modules—are integral components, especially for detectors that depend on patch-level, frequency, or cross-texture artifact cues (Zhong et al., 2023, Hu et al., 18 Jan 2026). In all cases, test and train sets remain disjoint at the generator level to enforce genuine generalization measurement.
2. Detection Protocols, Models, and Evaluation Regimes
AIGCDetectBenchmark is used to evaluate both conventional and advanced detectors under controlled regimes:
- Artifact-specialized approaches: Early and mid-generation CNNs (e.g., CNNSpot, GramNet, LGrad), PatchCraft (smash & contrast), FreqNet (FFT-based filtering), and frequency-attentive methods (S²F-Net, UnivFD).
- Transformer/Vision-Language Detectors: AIDE (CLIP-ViT backbone, dual expert heads for semantics and artifacts, with “sanity” regularization); Effort (orthogonal subspace projection and Mahalanobis distance); SemAnti (CLIP with Patch Shuffle and “semantic-antagonistic” subspace freezing) (Chu et al., 24 Nov 2025, Yan et al., 2024).
- Training regimes:
- Pretrained (PT): Off-the-shelf models trained on large public synthetic/real datasets (e.g., GenImage, ProGAN/LSUN).
- Fine-tuned (FT): Further adaptation to target subpopulations (e.g., Indian splits), using various optimizer/batch/epoch settings to probe adaptation–overfitting tradeoffs.
All detectors are evaluated in stringent cross-generator (training on only one generator; testing on 16+) and/or cross-demographic regimes (out-of-sample groupings by ethnicity, gender, etc.), quantifying true generalization and fairness.
3. Metrics and Core Evaluation Criteria
AIGCDetectBenchmark defines detection performance using:
- Accuracy (ACC):
- Area Under ROC Curve (AUC): Estimated via trapezoidal rule; reflects threshold-independent discrimination.
- Average Precision (AP): Area under the precision–recall curve,
with denoting precision at recall .
- Equal Error Rate (EER): The threshold where false-positive equals false-negative rate,
- Per-category and mean-per-class averages: For detailed analysis across source models.
Additional protocols such as reporting performance under image degradations (JPEG, Gaussian noise), crop/resize strategies, and augmentation-induced robustness are present in AIGIBench and AI-GenBench extensions (Li et al., 18 May 2025, Pellegrini et al., 29 Apr 2025).
4. Principal Experimental Findings
AIGCDetectBenchmark benchmarks expose several robust empirical patterns:
- In-domain fine-tuning yields strong gains: For example, AIDE AUC increases from 0.535 to 0.809, and Effort from 0.739 to 0.944 on HAV-DF-test after FT (Dubey et al., 2 Dec 2025).
- Catastrophic performance drops occur out-of-domain, especially for identity-preserving, demographically challenging, and commercial generator test sets: FT can decrease AIDE's AUC on HIDF-img-IP-genai from 0.923 to 0.563, Effort from 0.740 to 0.533; EER can rise by over 0.3 absolute.
- Detection accuracy varies across generator classes: PatchCraft achieves mean 89.9% ACC; AIDE 92.77% (GAN-Avg = 93.43%, DM-Avg = 92.11%); S²F-Net (cross-model paradigm) yields 90.49% average ACC across all 17 generators (Zhong et al., 2023, Yan et al., 2024, Hu et al., 18 Jan 2026).
- Artifact-driven methods generalize better to new synthetic styles: Contrasts between rich/poor texture patches and frequency anomalies are consistent universal fingerprints (Zhong et al., 2023, Hu et al., 18 Jan 2026).
- Passive, artifact-based detectors systematically underperform on advanced, identity-preserving, and social media–style edits as compared to traditional synthetic images, particularly on underrepresented ethnic subgroups, revealing fairness and brittleness gaps (Dubey et al., 2 Dec 2025).
- Semantic-agnostic or artifact-preserving adaptation is critical: Approaches like Patch Shuffle and SemAnti that suppress semantic bias and adapt only artifact-sensitive layers achieve near-perfect cross-model AP (99.02%) (Chu et al., 24 Nov 2025).
5. Fairness, Robustness, and Domain Shifts
A major focus of recent AIGCDetectBenchmark variants is fairness and demographic robustness:
- Generalization fails for underrepresented groups: Performance on Indian/South-Asian samples, and especially on IP-AIGC (identity-preserving edits), falls far below in-domain benchmarks, even with FT (Dubey et al., 2 Dec 2025).
- Overfitting to generator cues is a dominant failure mode: FT models tend to memorize superficial generator artifacts instead of learning identity-invariant or physically consistent features, resulting in pronounced performance decay on unseen distributions.
- Brittleness is especially acute for identity-preserving and content-style disentanglement tasks: Detectors are confounded by subtle changes in attire, lighting, or context, as opposed to global style transfer, highlighting the shortfall of current backbone designs.
Mitigation strategies proposed include representation-preserving tuning (e.g., subspace adaptation that freezes identity neurons) and India-aware curation with policy on age, skin-tone, and attire diversity.
6. Architectural Insights and Impact
Analysis of feature spaces and ablation studies clarify key success factors:
- Patch-based and frequency-oriented models (PatchCraft, S²F-Net): Smash & Reconstruction with high-pass filtering, cross-diversity rich/poor patch contrasts, and learnable frequency attention yield marked gains over global or semantic-based architectures (Zhong et al., 2023, Hu et al., 18 Jan 2026).
- CLIP-based hybrid and semantic-regulated architectures (AIDE, SemAnti): Gated fusion of high/low frequency cues with CLIP semantic representations excels if semantic bias is suppressed and adaptation is localized to artifact-computing layers (Chu et al., 24 Nov 2025).
- Ablation of core modules (e.g., LFA, patch contrast, semantic freezing) results in significant accuracy loss, confirming that detailed spectral and patch interactions are universally discriminative, whereas vanilla semantic cues are not reliable indicators out-of-domain.
7. Limitations, Open Challenges, and Future Directions
Although AIGCDetectBenchmark offers comprehensive coverage and strict protocols, certain limitations persist:
- No standard deviations or statistical confidence intervals are reported; all results are single-seed, and model releases are typically not fully reproducible (Dubey et al., 2 Dec 2025).
- Current detectors lack identity-agnostic and demographic fairness; cross-population generalization (especially for non-Western, non-male phenotypes) is an unsolved problem.
- Detection of partial, instruction-guided, and style-transfer edits remains challenging, with even frequency-feature and hybrid detectors faltering on contemporary web-based editors.
- Extension to multi-modal and explainable detection (e.g., XAIGID-RewardBench protocols) and continual updating as new generator classes emerge are critical for sustained relevance (Yang et al., 15 Nov 2025).
- Incorporating fairness constraints, multi-generator large-scale data, and representation-level adaptation is a concrete path to close existing generalization gaps, especially for at-risk sub-populations (Dubey et al., 2 Dec 2025).
AIGCDetectBenchmark thus defines the dominant paradigm for evaluating generalization, fairness, and cross-domain robustness of AIGC detectors in modern research, serving as both a proving ground for new detection algorithms and a stress test for real-world deployment viability (Zhong et al., 2023, Yan et al., 2024, Dubey et al., 2 Dec 2025, Hu et al., 18 Jan 2026, Chu et al., 24 Nov 2025).