Evading Defenses to Transferable Adversarial Examples by Translation-Invariant Attacks (1904.02884v1)

Published 5 Apr 2019 in cs.CV, cs.CR, and cs.LG

Abstract: Deep neural networks are vulnerable to adversarial examples, which can mislead classifiers by adding imperceptible perturbations. An intriguing property of adversarial examples is their good transferability, making black-box attacks feasible in real-world applications. Due to the threat of adversarial attacks, many methods have been proposed to improve the robustness. Several state-of-the-art defenses are shown to be robust against transferable adversarial examples. In this paper, we propose a translation-invariant attack method to generate more transferable adversarial examples against the defense models. By optimizing a perturbation over an ensemble of translated images, the generated adversarial example is less sensitive to the white-box model being attacked and has better transferability. To improve the efficiency of attacks, we further show that our method can be implemented by convolving the gradient at the untranslated image with a pre-defined kernel. Our method is generally applicable to any gradient-based attack method. Extensive experiments on the ImageNet dataset validate the effectiveness of the proposed method. Our best attack fools eight state-of-the-art defenses at an 82% success rate on average based only on the transferability, demonstrating the insecurity of the current defense techniques.

Authors (4)

Yinpeng Dong (102 papers)
Tianyu Pang (96 papers)
Hang Su (224 papers)
Jun Zhu (424 papers)

Citations (758)

View on Semantic Scholar

Summary

Translation-Invariant Attacks on Defenses for Transferable Adversarial Examples

Introduction

The susceptibility of deep neural networks (DNNs) to adversarial perturbations—slight, often imperceptible modifications to input data that elicit incorrect outputs—has significant implications for the deployment of these systems in security-critical applications. Adversarial examples, particularly those generated to be transferable across different model architectures, pose a formidable threat as they make black-box attacks feasible. Despite numerous defenses proposed to enhance model robustness, current methods fall short against attacks specifically designed to harness translation invariance in neural networks.

This paper addresses this gap by proposing a translation-invariant attack method capable of generating highly transferable adversarial examples. The method involves creating perturbations optimized over an ensemble of translated images, thereby enhancing their transferability and efficacy against defense models.

Methodology

Translation-Invariant Attack Mechanism

The core contribution is an algorithm that generates adversarial examples by considering an ensemble of translated versions of the input image. Given a classifier $f(\mathbf{x})$ and an image $\mathbf{x}$ , the proposed method optimizes for a perturbation that maximizes the loss function over both the original and its translated versions. This approach diminishes the sensitivity of the generated adversarial example to the spatial biases of the attacking white-box model.

The mathematical formulation involves the following objective function:

$\argmax_{\mathbf{x}^{\text{adv}}} \sum_{i,j} w_{ij} J(T_{ij}(\mathbf{x}^{\text{adv}}), y)$

subject to

$\|\mathbf{x}^{\text{adv}} - \mathbf{x}^{\text{real}}\|_{\infty} \leq \epsilon.$

Here, $T_{ij}$ represents translation by $i$ and $j$ pixels, and $w_{ij}$ are weighting factors for each translated version.

Gradient Calculation Efficiency

An efficient gradient calculation method is developed based on the approximate translation-invariance of CNNs. Instead of evaluating gradients for all translated images, the method convolves the gradient of the untranslated image with a pre-defined kernel, significantly reducing computational overhead.

Kernel Selection

Various kernels for gradient convolution, including uniform, linear, and Gaussian, are explored. Experimental results demonstrate that Gaussian and linear kernels generally yield higher success rates in black-box settings, reflecting their superior capacity to generate transferable perturbations.

Experimental Results

Experiments are conducted on the ImageNet dataset, targeting eight robust defense models:

Inc-v3\textsubscript{ens3}, Inc-v3\textsubscript{ens4}, IncRes-v2\textsubscript{ens}
High-Level Representation Guided Denoiser (HGD)
Random Resizing and Padding (R&P)
JPEG Compression and Total Variation Minimization (TVM)
NIPS 2017 defense competition's rank-3 submission.

Adversarial examples are crafted using both single-model and ensemble attacks on four normally trained models (Inc-v3, Inc-v4, IncRes-v2, and Res-v2-152).

Single-Model Attacks

The translation-invariant attacks consistently outperform baseline methods (FGSM, MI-FGSM, DIM) in black-box success rates, with improvements ranging from 5\% to 30\%. For instance, combining the translation-invariant method with DIM (TI-DIM) exhibits a black-box attack success rate averaging around 60\% against defense models when targeting the IncRes-v2 model.

Ensemble-Based Attacks

Ensemble-based attacks show significant performance boosts, with TI-DIM achieving an 82\% success rate in bypassing state-of-the-art defenses. This highlights the pronounced vulnerability of current defenses against strategically crafted transferable adversarial examples.

Implications and Future Work

The findings underscore a critical vulnerability in contemporary defense strategies. While these methods exhibit robustness against conventional black-box attacks, their susceptibility to translation-invariant adversarial examples questions their deployment in real-world, security-sensitive applications. Future research may involve developing more sophisticated defense mechanisms that account for spatial transformations or leveraging adversarial training against translation-invariant examples.

Conclusion

The paper demonstrates that current defenses are inadequate against translation-invariant adversarial examples by proposing and validating an effective method to generate such examples. The approach's broader implications suggest a need for re-evaluating and enhancing defense mechanisms in DNNs to secure them against such adversarial threats.

References

[Due to formatting constraints, references are implied to be added as per the content provided.]

The provided link to the method's implementation (\url{https://github.com/dongyp13/Translation-Invariant-Attacks}) facilitates further exploration and application within the research community.

PDF Markdown

Related Papers

Find Related Papers