Prior Convictions: Black-Box Adversarial Attacks with Bandits and Priors (1807.07978v3)

Published 20 Jul 2018 in stat.ML, cs.CR, and cs.LG

Abstract: We study the problem of generating adversarial examples in a black-box setting in which only loss-oracle access to a model is available. We introduce a framework that conceptually unifies much of the existing work on black-box attacks, and we demonstrate that the current state-of-the-art methods are optimal in a natural sense. Despite this optimality, we show how to improve black-box attacks by bringing a new element into the problem: gradient priors. We give a bandit optimization-based algorithm that allows us to seamlessly integrate any such priors, and we explicitly identify and incorporate two examples. The resulting methods use two to four times fewer queries and fail two to five times less often than the current state-of-the-art.

Citations (360)

View on Semantic Scholar

Summary

The paper introduces a unified gradient estimation framework that leverages least squares optimization to improve black-box adversarial attacks.
It integrates time and data-dependent gradient priors via bandit optimization to enhance query efficiency.
Experiments on ImageNet classifiers show the method reduces query costs by 2–4 times and lowers failure rates by up to 5 times.

Adversarial Attacks: Bandits and Priors in Black-Box Settings

The paper "Prior Convictions: Black-Box Adversarial Attacks with Bandits and Priors" by Andrew Ilyas, Logan Engstrom, and Aleksander Mądry focuses on generating adversarial examples in a black-box scenario where only loss-oracle access to a model is available. The authors propose a novel framework that unifies much of the previous work on black-box attacks and introduces new techniques to improve query efficiency by integrating prior information about the gradient.

Problem Context and Motivation

The susceptibility of neural networks to adversarial examples—inputs intentionally perturbed to mislead predictions—is a critical issue in machine learning. Most successful adversarial attacks have relied on the white-box threat model, which assumes access to model gradients. However, in many real-world situations, an attacker cannot access the model's gradients, necessitating the examination of black-box attacks. Previous works, such as those based on zeroth-order optimization and finite difference methods, have provided frameworks for conducting these attacks through classification queries. These methods, while effective, incur high query costs.

Contributions

The paper introduces several key contributions:

Unified Framework for Gradient Estimation: The research formalizes the problem of gradient estimation within black-box attacks. It suggests that the least squares method, common in signal processing, not only provides an optimal solution to this problem but aligns with the best current black-box attack methods.
Incorporation of Gradient Priors: Despite the optimality of current methods, the authors explore the potential of leveraging prior information about gradients to enhance performance. They identify two relevant classes of priors—time-dependent and data-dependent—and demonstrate significant improvements in attack efficiency by integrating these priors.
Bandit Optimization Approach: The researchers develop a bandit optimization framework that seamlessly integrates these priors into the gradient estimation process. Their approach significantly reduces query requirements by a factor of two to four and decreases failure rates by two to five times compared to the state of the art.

Methodology

The central theme of the paper is treating the gradient estimation challenge as a classical linear regression problem, where the least squares method provides a base solution. The authors argue that assuming an adversary's complete ignorance of the gradient is overly conservative, and that naturally occurring structures (e.g., in images) and iterative correlation in computed gradients present opportunities for enhancement. Their bandit optimization framework allows for iterative integration of this additional structure in a computationally efficient manner.

Results

The paper reports an extensive evaluation of their approach on the ImageNet dataset using Inception-v3, ResNet-50, and VGG16 classifiers across two adversarial perturbation metrics (‘1 and ‘2 norms). The approach named "BanditsTD," which integrates both time and data-dependent priors, demonstrated substantially improved query efficiency and reduced failure rates compared to previous methods like NES. The improvements are consistent across different classifiers and threat models, highlighting the robustness of the method.

Implications and Future Work

This work raises important implications for both adversarial attack efficacy and defense strategies. By acknowledging and incorporating inherent structures in data and optimization trajectories, the paper pioneers a methodology that can be generalized beyond specific datasets or models. It encourages the exploration of additional priors in different contexts and opens a path for more efficient attack strategies in black-box settings.

From a defensive standpoint, understanding the specific weaknesses that priors exploit could lead to more robust models capable of resisting such informed attacks. Future research could investigate the transferability of this framework to other domains or adversarial settings and explore the intersection with modern machine learning models that assume different forms of data distribution or problem structures.

Overall, the paper provides a comprehensive and innovative approach to enhance the state of black-box adversarial attacks, highlighting the interplay between optimization, prior knowledge, and adversarial methodologies in AI.