GF-eval: Generalized Forgery Evaluation

Updated 9 January 2026

Generalized Forgery Evaluation (GF-eval) is a unified framework that quantifies and benchmarks forgery detection and authentication systems using cross-domain evaluation metrics.
It partitions evaluation axes—such as generator methods, datasets, and application modalities—to systematically measure generalization gaps and enhance system robustness.
GF-eval leverages diverse benchmarks and metrics, guiding the development of resilient systems in deepfake analysis, cryptography, and quantum-secure authentication.

Generalized Forgery Evaluation (GF-eval) is a formal evaluation paradigm and metric suite developed to quantify the reliability, robustness, and generalizability of forgery detection and authentication systems. Spanning deepfake analysis in AI-generated media, digital signature cryptosystems, and quantum-secure authentication, GF-eval provides a unified framework for benchmarking the integrity of systems in the face of evolving, diverse, and previously unseen manipulations. Its objective is not only to measure in-domain classification accuracy but also to systematically stress-test detectors and schemes against out-of-domain threats, reflecting real-world deployment challenges where attack surfaces continuously diversify.

1. Formal Definition and Core Motivation

GF-eval’s principal aim is to evaluate not merely the accuracy of a detector or authentication system “in-domain” (i.e., against forgeries similar to those present in training), but the degree to which performance persists or deteriorates on “out-of-domain” samples—those produced by novel, unseen forgery methods, distinct datasets, or application modalities. The principal metric is the generalization gap:

$\Delta_g = |M_{\mathrm{train}} - M_{\mathrm{test}}|$

where $M_{\mathrm{train}}$ and %%%%1%%%% are instantiations of a chosen performance metric (e.g., AUC, accuracy) on the training and out-of-domain test sets, respectively. A detector with high in-domain accuracy but large $\Delta_g$ is considered brittle and non-robust. GF-eval protocols explicitly partition evaluation axes (generator methods, datasets, modalities) to enable cross-generator, cross-dataset, and cross-application measurement (Bei et al., 2024, Gao et al., 27 Mar 2025, Wang et al., 19 Mar 2025).

In cryptographic contexts, GF-eval is extended to capture the strength of forgery-detection availability (FDA), quantifying both the existential unforgeability of a scheme and the probability that a successful forgery can be demonstrated and recognized (Kiktenko et al., 2019). In quantum cryptography, the framework further parameterizes what constitutes a truly “novel” challenge, via a fidelity parameter $\mu$ , capturing the overlap between a forged object and all previously queried objects (Doosti et al., 2021).

2. Benchmark Construction and Dataset Principles

Large-scale, diverse benchmarks are central to effective GF-eval. Key datasets include DeepFaceGen, ForgeryNet, MMFR-Dataset, and Forensics-Bench:

DeepFaceGen (Bei et al., 2024): 1.55 million facial samples, leveraging 34 forgery methods (17 localized editing, 17 full-image generation), spanning images and videos. Prompt attributes (age, gender, skin tone, etc.) combinatorially yield over 40,000 prompt-driven samples. Skin-tone classifiers provide ethnic fairness, and n-way labeling ensures fine-grained attributive analysis.
ForgeryNet (He et al., 2021): 2.9 million images and 221,247 videos with multi-granularity annotations (labels, segmentation masks, temporal segments) for image/video classification and spatial/temporal localization tasks.
MMFR-Dataset (Gao et al., 27 Mar 2025): 100,000 images from 10 held-out GAN/diffusion models, annotated for ten distinct forgery reasoning attributes (e.g., geometric inconsistencies, abnormal objects) and supporting chain-of-thought structured evaluation.
Forensics-Bench (Wang et al., 19 Mar 2025): 63,292 examples across 112 unique forgery detection types, orthogonally categorized by semantics, modality, task (classification, localization), type, and generative model, facilitating multi-perspective generalization analysis.

All benchmarks prioritize exhaustive coverage across manipulation approaches, perturbations (post-processing distortions), demographic factors, and application contexts, reducing the risk of overfitting and supporting permanent “zero-day” generalization testing.

3. Evaluation Protocols and Metrics

GF-eval protocols instantiate evaluation along multiple axes:

Cross-generator: Train on a subset of generator types, test on generators unseen in training.
Cross-dataset: Train and test across disjoint data sources, potentially standardized and external datasets.
Cross-application: Modality transfer (image↔video), application context divergence (face ↔ text ↔ multimodal).

Standard and task-specific metrics include:

Metric	Formula/Definition	Role in GF-eval
Accuracy	Acc = $\frac{1}{N} \sum_{i=1}^N 𝟙[\hat{y}_i = y_i]$	Global correctness
Precision	P = $\frac{TP}{TP + FP}$	Correctness on positives
Recall	R = $\frac{TP}{TP + FN}$	Sensitivity to actual positives
AUC	Area under ROC: $\int_0^1 TPR(FPR^{-1}(u)) du$	Threshold-independent discrimination
Equal Error Rate (EER)	ε s.t. $TPR(\tau) = 1 - FPR(\tau) = 1 - ε$	Point where FPR = 1 - TPR
Generalization gap ( $\Delta_g$ )	$\|M_{\mathrm{train}} - M_{\mathrm{test}}\|$	Cross-domain robustness
Macro/Weighted/F1	As per standard definitions	Aggregated, fair metrics
mAP, IoU, L1, AR@K, AP@t	Task-appropriate (segmentation, localization)	Fine-grained spatial/temporal analysis
Reasoning metrics	BLEU-1, BLEU-2, ROUGE-L, CSS	Chain-of-thought/semantic evaluation

These metrics are reported per approach, modality, manipulation family, and at global macro/weighted averages to rigorously expose generalization deficits.

4. Representative Methodologies and Benchmark Results

State-of-the-art detectors benchmarked under GF-eval span convolutional networks, detail-oriented modules, self-supervised and reasoning-augmented architectures:

Detail-focused detectors: RECCE (reconstruction-classification), DNANet (contrastive projector), FreqNet (frequency analysis) outperform generic CNN backbones. For instance, DNANet achieves high in-domain and cross-generator AUC, with $\Delta_g \approx 0.05$ (Bei et al., 2024).
Video-level models: Exposing (multi-region local bottleneck), SLADD (adversarial self-supervised augmentation), CViT (Vision Transformer) show superior cross-generator robustness, especially under randomized segment forgeries (Bei et al., 2024).

Vision-LLMs (VLMs) have demonstrated efficacy in multi-modal, reasoning-rich contexts:

FakeReasoning (Gao et al., 27 Mar 2025): Employs forgery-aligned contrastive learning and calibrated probability mapping, yielding average out-of-domain accuracy of 90.98% across 10 diverse generators, consistently outperforming prior baselines.
Forensics-Bench (Wang et al., 19 Mar 2025): LLaVA-NeXT-34B achieves 66.7% overall accuracy in zero-shot evaluation; GPT-4o achieves 57.9%. Binary classification performance exceeds 70%, while spatial and temporal localization remain challenging (≈35–45% and 30–35%).

The table below summarizes recent results from key benchmarks:

Method/Model	Task	In-domain Acc/AUC	Out-domain $\Delta_g$	Benchmark/Source
DNANet, RECCE	Image Det.	AUC ≈ 0.90	≈ 0.05–0.07	DeepFaceGen (Bei et al., 2024)
Exposing, SLADD	Video Det.	AUC ≈ 0.93	≈ 0.06–0.12	DeepFaceGen (Bei et al., 2024)
FakeReasoning	VLM Det.+Reasoning	Acc 91.84%	Out-model avg 90.98%	MMFR-Dataset (Gao et al., 27 Mar 2025)
LLaVA-NeXT-34B	VLM, Forensics-Bench	Acc 66.7%	Macro F1 varies	Forensics-Bench (Wang et al., 19 Mar 2025)

This evidence establishes that model architectures integrating feature-level detail extraction and contrastive attribute alignment most effectively minimize generalization gap.

5. Cross-domain Extensions and Cryptographic Integration

GF-eval’s scope extends beyond media forensics to cryptographic signatures and quantum authentication:

Digital Signature FDA (Kiktenko et al., 2019): Lamport and Winternitz hash-based schemes exhibit provable forgery-detection availability (FDA), whereby for any adversarial forgery, a collision-producing witness can be computed with failure probability $\varepsilon < 5.22 \cdot 2^{-\delta}$ , formalizing robust detection as an additional security metric.
Quantum Unforgeability (Doosti et al., 2021): Provides a unified, parameterized security game $G_F^{q,c,\mu}$ , interpolating classical and quantum adversarial models. The overlap parameter $\mu$ rigorously defines “novel” forgeries; randomization (via quantum-secure PRFs/PRUs) is necessary to withstand superposition attacks. Deterministic constructions fail except in trivial or orthogonal cases.

GF-eval thus enables vector-valued security evaluation, combining standard forgeability metrics ( $\mathrm{Adv}^{\mathrm{forge}}$ ) and forgery detection advantage ( $\mathrm{Adv}^{\mathrm{FDA}}$ ) for a complete robustness assessment.

6. Insights, Limitations, and Future Directions

Key experimental and conceptual findings from recent GF-eval research include:

Feature Transferability: Localized editing forgeries induce highly transferable facial features, facilitating lower $\Delta_g$ when generalizing to full-image generation fakes (Bei et al., 2024).
Attribute and Reasoning Alignment: Explicit semantic alignment (e.g., via contrastive reasoning) yields substantial generalization improvements, especially at the VLM level (Gao et al., 27 Mar 2025).
Diversity and Perturbation Augmentation: Extensive manipulation diversity and realistic post-processing augmentations are critical for unsaturated learning and robust evaluation (He et al., 2021).
Modality Effects: Diffusion and autoregressive-generated content is harder to detect than GAN-based forgeries, but prompt modality (text vs. image) exerts minimal effect on detection difficulty (Bei et al., 2024).
Boundary-aware and Multi-task Models: Multi-task learning hierarchies and segmentation/localization architectures enhance spatial/temporal detection capabilities, reflected in high AR@5 and avg.AP scores under heavy perturbations (He et al., 2021).
Cryptographic Attacks and Defenses: Quantum adversaries necessitate randomization against superposition attacks; the GF-eval paradigm catalogues and formalizes limitations and possible constructions across both classical and quantum settings (Doosti et al., 2021).

Limitations include incomplete coverage of emerging modalities (e.g., audio-visual deepfakes), evolving generative models, and real-world deployment domain shifts. Continuous benchmark updating, multi-modal expansion, richer metrics (attribution, calibration), and self-evolving detection frameworks are outlined as forward directions (Bei et al., 2024, Wang et al., 19 Mar 2025).

7. Significance and Broader Implications

GF-eval represents a maturation of the evaluation philosophy in forgery detection and authentication, transitioning from static binary accuracy to dynamic, generalization-aware, and multi-task assessment. By providing unified protocol definitions, exhaustively labeled benchmarks, rigorous metrics, and integration across AI/crypto/quantum domains, GF-eval enables robust measurement of detector and scheme resilience in adversarially evolving contexts. Its adoption drives the iterative development of next-generation detection frameworks and cryptosystems positioned to maintain integrity despite relentless adversarial innovation. This suggests that future security-critical research and deployment initiatives will increasingly rely on GF-eval metrics—and the associated reduction in generalization gap—as a central indicator of operational trustworthiness.