Black-Box Forgery Attacks

Updated 4 February 2026

Black-box forgery attacks are adversarial techniques that manipulate system inputs using only observable behaviors to counterfeit authenticity and evade detection.
They exploit methods such as surrogate transfer, latent inversion, and query-efficient as well as frequency-based attacks under restricted access conditions.
These attacks pose significant threats to security domains like watermarking, face recognition, and recommender systems, driving the need for advanced robust defenses.

Black-box forgery attacks are adversarial strategies that manipulate inputs to machine learning systems or generative models to evade detection, counterfeit authenticity signals, or produce outputs attributed to a legitimate actor, all under conditions where the attacker lacks access to model parameters, internal architectures, or often even direct output gradients. These attacks fundamentally exploit observable input-output behavior, system APIs, or weak publicly available proxies to achieve their objectives. Black-box forgery is now a central challenge for security, anti-fraud, watermarking, recommendation systems, and media authentication, with critical implications across domains such as image synthesis, face/voice recognition, and recommender robustness.

1. Formal Threat Models and Attack Paradigms

The black-box forgery landscape is characterized by restricted attacker knowledge and access:

Oracle or Query-Only Access: The attacker interacts with the system solely via queries, observing outputs (e.g., hard labels, soft probabilities, synthesized content, or detection/verdicts), but cannot access model weights or internal states (Liu et al., 2019, Kilcher et al., 2017, Li et al., 2020).
Transfer-Oriented Attacks: Often, attackers train a substitute ("surrogate") model on public or similar datasets, and hope adversarial examples transfer to the true model (Liu et al., 2019, Kilcher et al., 2017, Dong et al., 2022).
No-Query/Proxy-Only: Some regimes assume that even queries are unavailable; attackers use only public knowledge and substitute models trained independently (Dong et al., 2022).
API-Restricted: Particularly with diffusion models and watermarking, attackers may obtain a single watermarked example and query the authentication API, but lack generative access to the target model (Müller et al., 2024, Jain et al., 27 Apr 2025).

Within these threat models, the attacker's objective may be to produce counterfeit examples (images, profiles, audio, etc.) that are classified as genuine, embed a target watermark, evade detectors, or amplify or degrade recommendations for specific entities.

2. Methodologies for Black-Box Forgery

Most black-box forgery attacks leverage one or more of the following methodologies:

2.1 Surrogate/Transfer-based Attacks

The attacker trains a high-performing surrogate model—usually of similar architecture—on public data, crafts adversarial inputs (via FGSM, PGD, or other methods), and then deploys these against the true black-box model (Liu et al., 2019, Kilcher et al., 2017, Dong et al., 2022). The effectiveness correlates with the similarity between surrogate and target model capacity and architecture, and with the exploitation of label distribution or output softmax information.

2.2 Latent Inversion and Proxy Generative Models

With generative models, particularly diffusion architectures, attackers invert outputs using public proxy models to estimate latent codes, then modify new covers or seeds to match the target watermark signature (Müller et al., 2024, Jain et al., 27 Apr 2025). This approach is facilitated by the many-to-one mapping between images and initial latent noise, allowing the attacker to optimize or reuse latents to forge or remove watermarks.

2.3 Query-Efficient or Gradient-Sign Attacks

Algorithms such as SignHunter recover adversarial directions by efficiently estimating the sign of the loss gradient using minimal queries and divide-and-conquer techniques (Al-Dujaili et al., 2019). Such attacks are highly query-efficient, hyperparameter-free, and have been shown to match or surpass prior state-of-the-art black-box methods under both $\ell_{\infty}$ and $\ell_2$ constraints.

2.4 Frequency and Decision-Based Attacks

Recent works deploy attacks in the frequency domain, targeting statistical signatures that detectors use (e.g., DCT coefficients in face forgery detection), or conduct decision-based optimization by only relying on the final class/decision output, sometimes utilizing cross-task initialization or fusion modules to preserve quality and stealth (Jia et al., 2022, Chen et al., 2023).

2.5 Adversarial Profile Injection

In recommender systems, attackers inject fake user profiles, optimizing their structure to manipulate item rankings without full system introspection. Knowledge graphs and public item features can be integrated into attack policies via reinforcement learning for more effective black-box attacks (Chen et al., 2022).

A summary table of selected attack paradigms:

Paradigm	Key Mechanism	Canonical Applications
Surrogate Transfer	Substitute model crafting	ASV, classification, DeepFake
Latent Inversion	Proxy VAE/DDIM inversions	Diffusion watermark attacks
Frequency-Domain	DCT/frequency perturbations	Face forgery, deepfakes
Query-efficient Sign	Sign-based finite difference	Image classifiers, biometrics
Profile Injection	Hierarchical RL + KG	Recommender manipulation

3. Empirical Evaluation and Security Impact

3.1 Attack Success Metrics

Quantitative empirical analyses across domains highlight the potency and generalizability of black-box forgery attacks:

Adversarial example generation: On ASV spoofing countermeasures, transfer-based PGD attacks using surrogates drive Equal Error Rate (EER) well above 50% in black-box settings, with PGD outperforming FGSM as perturbation budgets increase (Liu et al., 2019).
Diffusion watermark forging: Latent-noise watermark schemes (Tree-Ring, Gaussian Shading) exhibit 79–100% attack success rates (detection or attribution via p-value/bit-accuracy) across both SD v1.4 and v2.0 models using only a single watermarked reference, with minimal perceptual distortion (LPIPS $<$ 0.35, SSIM $>$ 0.75, PSNR $>$ 28dB) (Jain et al., 27 Apr 2025, Müller et al., 2024).
Face forgery detection: Hybrid frequency–spatial attacks yield attack success rates up to $\sim$ 50% in strict black-box transfer (e.g., ResNet-50→Xception) and up to $\sim$ 80%–100% for frequency-based detectors (Jia et al., 2022).
Recommendation attacks: Knowledge-graph–guided black-box profile injection via hierarchical RL achieves effective item demotion or promotion, even under strong system opacity (Chen et al., 2022).

3.2 Visual and Perceptual Quality

Black-box forgery methods have minimized perceptual artifacts: frequency-domain attacks and distribution-aware optimization (e.g., explicit SSIM/LPIPS minimization) produce near-imperceptible changes, confirmed empirically via human observer studies and distortion metrics (Jia et al., 2022, Li et al., 2020). In biometric and forgery-sensitive domains, this imperceptibility is crucial for practical attack viability.

3.3 Generalization and Cross-Model Transfer

Many attack approaches generalize across model families—latent inversion attacks succeed even when proxy and target use distinct architectures (UNet vs. DiT) or divergent training data (Müller et al., 2024, Jain et al., 27 Apr 2025). Black-box DeepFake attacks using only a substitute autoencoder as a surrogate can degrade unseen face-swapping models and even cross into other face-editing domains (StarGAN, AttGAN) (Dong et al., 2022).

4. Defenses and Countermeasures

4.1 Adversarial Robustness Techniques

Proposed defenses include adversarial training (inclusion of attack-generated examples), input-processing (e.g., feature denoising, randomization), certified robustness (e.g., Lipschitz regularization, interval bound propagation), and the development of ensemble or smoothing strategies (Liu et al., 2019, Li et al., 2020).

4.2 Defense by Output Obfuscation

Output label perturbation, in which the model slightly perturbs the returned distribution without changing the top-1 decision, can render substitute training ineffective by causing attacker gradients to diverge, thereby foiling transfer-based attacks when soft-label queries are allowed (Kilcher et al., 2017).

4.3 Watermarking with Semantic Binding

Recent results demonstrate that most diffusion watermarking schemes based purely on inverting and decoding the initial latent are fundamentally vulnerable. Binding the watermark signature to image semantics via learned contrastive masks—such as in the SemBind framework—can remediate this vulnerability. SemBind achieves a tunable drop in black-box forgery success (e.g., from ~100% to <10%) while preserving image quality and robustness (Zhang et al., 28 Jan 2026).

Defense	Mechanism	Limitation
Adversarial Training	Incorporate attack examples	Computational cost, often only partial effectiveness
Output Obfuscation	Perturb label distributions	No effect with hard-label APIs only
Semantic Binding	Bind watermark to semantics	Requires extra generation, private mask function

5. Limitations, Open Problems, and Future Directions

5.1 Fundamental Vulnerabilities

Current inversion-based and frequency-domain watermark schemes (Tree-Ring, Gaussian Shading, RingID, etc.) are fundamentally broken under black-box threat: lone reference watermark examples suffice for near-perfect attack, and no simple threshold can distinguish legitimate and forged/counterfeited images under realistic image transformations (Müller et al., 2024, Jain et al., 27 Apr 2025).

5.2 Persisting Technical Challenges

Surrogate-generalization gap: While similarity between surrogate and target enhances transferability, architectural or training-data mismatch reduces but does not eliminate attack success.
Detection-evasion arms race: Frequency attacks and perceptual loss minimization force defenders to constantly revise detection and verification pipelines (Jia et al., 2022, Li et al., 2020).
Limited efficacy of threshold-based defenses: Even fine-tuned detection thresholds fail against adaptive black-box attackers in watermarking schemes (Müller et al., 2024, Zhang et al., 28 Jan 2026).

5.3 Promising Research Directions

Semantically coupled watermarking: New frameworks (e.g., SemBind) that link watermark decodability to semantic image content, with formal undetectability guarantees and empirical resistance to proxy-based attacks (Zhang et al., 28 Jan 2026).
Structure-aware attacks and defenses: Incorporating structural priors and compositionality in both attack (e.g., superpixels, binary partitioning) and defense (e.g., feature-space manifold projection) (Al-Dujaili et al., 2019).
Cross-domain and modality-agnostic methods: Extending attack and defense frameworks to audio, tabular, and other data types, leveraging domain-adaptive representations (Liu et al., 2019, Dvořáček et al., 2023).

6. Domain-Specific Applications

6.1 Recommender Systems

Attacks inject adversarial user profiles, leveraging public knowledge graphs via hierarchical RL to optimize attack efficacy under black-box constraints (Chen et al., 2022).

6.2 Face and Media Forgery

Face swapping and DeepFake protection: Autoencoder-based TCA-GAN attackers disrupt black-box swappers by maximizing latent divergence and post-regularization, improving the effectiveness of DeepFake detection (Dong et al., 2022).
Face forgery detection: Frequency-based attacks succeed against both spatial and frequency-domain detectors, demonstrating the importance of hybrid and ensemble attack strategies (Jia et al., 2022, Chen et al., 2023).

6.3 Diffusion Watermarking

Latent-noise and semantic watermark schemes are vulnerable to attacks based on proxy inversion and latent modulation, which enable forging or removal at high fidelity under black-box assumptions (Müller et al., 2024, Jain et al., 27 Apr 2025). Semantic binding of watermarks appears to be a critical direction for future defense (Zhang et al., 28 Jan 2026).

7. Summary and Outlook

Black-box forgery attacks have become an acute and generalized threat across machine learning security, digital watermarking, recommender integrity, and forgery detection. The arms race between attackers exploiting transferability, latent inversion, and query-efficient search, and defenders developing semantically bound and perceptually robust countermeasures, will drive innovation in both attack methodology and provable model robustness. Large-scale empirical evaluations, model-agnostic protocols, and application-specific strategies rooted in fundamental statistical and representation-theoretic analysis remain essential for the advancement of the field.

References:

(Liu et al., 2019, Kilcher et al., 2017, Chen et al., 2022, Jia et al., 2022, Chen et al., 2023, Müller et al., 2024, Jain et al., 27 Apr 2025, Li et al., 2020, Dvořáček et al., 2023, Dong et al., 2022, Al-Dujaili et al., 2019, Zhang et al., 28 Jan 2026)