Translation-Invariant Attacks on Defenses for Transferable Adversarial Examples
Introduction
The susceptibility of deep neural networks (DNNs) to adversarial perturbations—slight, often imperceptible modifications to input data that elicit incorrect outputs—has significant implications for the deployment of these systems in security-critical applications. Adversarial examples, particularly those generated to be transferable across different model architectures, pose a formidable threat as they make black-box attacks feasible. Despite numerous defenses proposed to enhance model robustness, current methods fall short against attacks specifically designed to harness translation invariance in neural networks.
This paper addresses this gap by proposing a translation-invariant attack method capable of generating highly transferable adversarial examples. The method involves creating perturbations optimized over an ensemble of translated images, thereby enhancing their transferability and efficacy against defense models.
Methodology
Translation-Invariant Attack Mechanism
The core contribution is an algorithm that generates adversarial examples by considering an ensemble of translated versions of the input image. Given a classifier and an image , the proposed method optimizes for a perturbation that maximizes the loss function over both the original and its translated versions. This approach diminishes the sensitivity of the generated adversarial example to the spatial biases of the attacking white-box model.
The mathematical formulation involves the following objective function:
subject to
Here, represents translation by and pixels, and are weighting factors for each translated version.
Gradient Calculation Efficiency
An efficient gradient calculation method is developed based on the approximate translation-invariance of CNNs. Instead of evaluating gradients for all translated images, the method convolves the gradient of the untranslated image with a pre-defined kernel, significantly reducing computational overhead.
Kernel Selection
Various kernels for gradient convolution, including uniform, linear, and Gaussian, are explored. Experimental results demonstrate that Gaussian and linear kernels generally yield higher success rates in black-box settings, reflecting their superior capacity to generate transferable perturbations.
Experimental Results
Experiments are conducted on the ImageNet dataset, targeting eight robust defense models:
- Inc-v3\textsubscript{ens3}, Inc-v3\textsubscript{ens4}, IncRes-v2\textsubscript{ens}
- High-Level Representation Guided Denoiser (HGD)
- Random Resizing and Padding (R&P)
- JPEG Compression and Total Variation Minimization (TVM)
- NIPS 2017 defense competition's rank-3 submission.
Adversarial examples are crafted using both single-model and ensemble attacks on four normally trained models (Inc-v3, Inc-v4, IncRes-v2, and Res-v2-152).
Single-Model Attacks
The translation-invariant attacks consistently outperform baseline methods (FGSM, MI-FGSM, DIM) in black-box success rates, with improvements ranging from 5\% to 30\%. For instance, combining the translation-invariant method with DIM (TI-DIM) exhibits a black-box attack success rate averaging around 60\% against defense models when targeting the IncRes-v2 model.
Ensemble-Based Attacks
Ensemble-based attacks show significant performance boosts, with TI-DIM achieving an 82\% success rate in bypassing state-of-the-art defenses. This highlights the pronounced vulnerability of current defenses against strategically crafted transferable adversarial examples.
Implications and Future Work
The findings underscore a critical vulnerability in contemporary defense strategies. While these methods exhibit robustness against conventional black-box attacks, their susceptibility to translation-invariant adversarial examples questions their deployment in real-world, security-sensitive applications. Future research may involve developing more sophisticated defense mechanisms that account for spatial transformations or leveraging adversarial training against translation-invariant examples.
Conclusion
The paper demonstrates that current defenses are inadequate against translation-invariant adversarial examples by proposing and validating an effective method to generate such examples. The approach's broader implications suggest a need for re-evaluating and enhancing defense mechanisms in DNNs to secure them against such adversarial threats.
References
[Due to formatting constraints, references are implied to be added as per the content provided.]
The provided link to the method's implementation (\url{https://github.com/dongyp13/Translation-Invariant-Attacks}) facilitates further exploration and application within the research community.