Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks (2310.00076v2)

Published 29 Sep 2023 in cs.CV

Abstract: In light of recent advancements in generative AI models, it has become essential to distinguish genuine content from AI-generated one to prevent the malicious usage of fake materials as authentic ones and vice versa. Various techniques have been introduced for identifying AI-generated images, with watermarking emerging as a promising approach. In this paper, we analyze the robustness of various AI-image detectors including watermarking and classifier-based deepfake detectors. For watermarking methods that introduce subtle image perturbations (i.e., low perturbation budget methods), we reveal a fundamental trade-off between the evasion error rate (i.e., the fraction of watermarked images detected as non-watermarked ones) and the spoofing error rate (i.e., the fraction of non-watermarked images detected as watermarked ones) upon an application of diffusion purification attack. To validate our theoretical findings, we also provide empirical evidence demonstrating that diffusion purification effectively removes low perturbation budget watermarks by applying minimal changes to images. The diffusion purification attack is ineffective for high perturbation watermarking methods where notable changes are applied to images. In this case, we develop a model substitution adversarial attack that can successfully remove watermarks. Moreover, we show that watermarking methods are vulnerable to spoofing attacks where the attacker aims to have real images identified as watermarked ones, damaging the reputation of the developers. In particular, with black-box access to the watermarking method, a watermarked noise image can be generated and added to real images, causing them to be incorrectly classified as watermarked. Finally, we extend our theory to characterize a fundamental trade-off between the robustness and reliability of classifier-based deep fake detectors and demonstrate it through experiments.

PDF HTML Abstract

Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks

In "Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks," the authors address the growing necessity to differentiate AI-generated images from authentic ones, given the rapid advancements in generative AI models. This differentiation is critical for preventing the misuse of AI-generated objects in various domains, including misinformation, fraud, and national security threats. Several methods have emerged to identify AI-generated images, with watermarking featuring prominently due to its potential to reliably trace content back to its source.

The paper presents a rigorous analysis of AI-image detectors, focusing on watermarking methods and classifier-based detectors for deepfake images. The authors draw attention to a fundamental trade-off in watermarking approaches that employ subtle perturbations (low perturbation budget methods). Specifically, they report a significant interplay between the evasion error rate — the proportion of watermarked images incorrectly identified as non-watermarked — and the spoofing error rate — the fraction of non-watermarked images erroneously classified as watermarked — when subjected to diffusion purification attacks.

Diffusion purification, originally designed to counter adversarial examples, involves altering images with Gaussian noise and subsequently employing diffusion models to eliminate this noise. The authors provide both theoretical and empirical evidence supporting the efficacy of this attack in compromising low perturbation budget watermarking methods. They demonstrate the attack's success with minimal changes, suggesting that watermarking methods with low Wasserstein distances between the distribution of watermarked and non-watermarked images are vulnerable.

For high perturbation budget methods, where images undergo significant alterations, diffusion purification proves ineffective. The authors introduce a model substitution adversarial attack capable of removing robust watermarks. This black-box attack, notably effective against the TreeRing watermark, involves training a substitute classifier to discern watermarked from non-watermarked images. They then use a projected gradient descent (PGD) attack on this classifier to manipulate images, successfully transferring these manipulations to fool the original watermark detector.

The paper also highlights spoofing attacks where adversaries aim to classify inappropriate real images as watermarked, impairing developers’ reputations. These can be executed by creating a watermarked noise image added to clean images, misleading detectors into flagging them incorrectly.

Authors further extend their theoretical framework to classifier-based deepfake detectors, highlighting a trade-off between robustness and reliability. They argue that as the distribution of real and fake images converge, maintaining detector robustness without sacrificing reliability poses a substantial challenge.

Key contributions of the paper include:

Establishing a fundamental trade-off in watermarking methods between evasion and spoofing errors via diffusion purification attacks.
Developing model substitution adversarial attacks that remove robust watermarks used in AI-image detection.
Introducing spoofing attacks against watermarking techniques to affect false classification outcomes.
Identifying a robustness-reliability trade-off for classifier-based fake image detectors, using experiments to validate these findings.

The implications of this research are significant, signaling the complexity involved in creating effective watermark detectors that can resist sophisticated attacks while maintaining integrity. The paper elucidates that watermarking approaches must evolve to withstand not only known vulnerabilities but also adapt to emerging attack strategies. Especially concerning AI-generated content, robust detection tools are essential to prevent the technology's misuse. Moreover, investigating potential advancements in generative AI suggests a continued interplay between innovation and security, demanding ongoing attention to the reliability and security of both media generation and detection technologies.