- The paper introduces novel threat models and a NES-based attack method that achieves over 90% success in query-limited, partial-information, and label-only settings.
- It employs gradient estimation with NES and PGD to substantially reduce query counts while enabling targeted attacks on an ImageNet classifier and the Google Cloud Vision API.
- Empirical evaluations report a 99.2% success rate in the query-limited setting, highlighting the attack's robustness and practical implications for adversarial ML security.
Black-box Adversarial Attacks with Limited Queries and Information
The paper "Black-box Adversarial Attacks with Limited Queries and Information" presents a substantial contribution to the field of adversarial machine learning by defining new threat models and developing corresponding attack methodologies. The authors, Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin from MIT and LabSix, explore the vulnerability of neural network-based classifiers to adversarial examples under restricted black-box settings, demonstrating the efficacy of their methods on both an ImageNet classifier and the Google Cloud Vision API.
Threat Models
The paper identifies three realistic threat models for black-box adversarial attacks:
- Query-limited setting: This model restricts the number of queries an attacker can make to the classifier, reflecting practical limitations such as monetary costs or time constraints.
- Partial-information setting: In this setting, the attacker only has access to the top-k class probabilities, or scores that do not necessarily sum to one, rather than the full output space.
- Label-only setting: Here, the attacker can only observe the sorted order of the top-k inferred labels without any accompanying probability scores or confidence metrics.
Methodological Contributions
Query-limited Setting
For the query-limited setting, the authors propose using Natural Evolutionary Strategies (NES) for efficient black-box gradient estimation. NES is shown to significantly reduce the number of required queries compared to previous methods such as finite differences. The proposed attack uses NES to estimate gradients and employ Projected Gradient Descent (PGD) to generate adversarial examples within a query budget. This method demonstrates successful targeted attacks with significantly fewer queries, highlighting its practicality in real-world scenarios where query limitations are stringent.
Partial-information Setting
The partial-information attack starts with an instance of the target class and iteratively blends it with the original image while maximizing the target class's likelihood. This method adapts NES to work under partial information, achieving high success rates even when the attacker only sees the top probability score. This strategy culminated in a successful targeted attack on the Google Cloud Vision API, showcasing its applicability to large-scale, commercial classifiers that provide limited feedback.
Label-only Setting
In the label-only setting, the authors introduce a ranking-based surrogate score to characterize the adversarial strength of perturbations. By incorporating noise robustness to approximate classification confidence, the attack iteratively minimizes this surrogate score until the adversarial class ranks highest. This method effectively generates adversarial examples even when the model only outputs the top-k labels, with no probability scores or confidence measures available.
Empirical Evaluation and Results
The evaluation of these methods is performed on a pre-trained InceptionV3 network using the ImageNet dataset. The attacks were tested over 1000 randomly chosen images with target classes, reporting high success rates across all settings:
- Query-limited setting: 99.2% success with significantly fewer queries compared to baseline methods.
- Partial-information setting: 93.6% success rate with k=1, where the attacker only has access to the top class probability.
- Label-only setting: 90% success rate under the most restricted conditions.
Additionally, the targeted attack on the Google Cloud Vision API demonstrated the algorithm's robustness and practicality by correctly misclassifying a skier image as a dog with minimal perturbation, thus substantiating the method's efficacy in a partial-information setting.
Implications and Future Directions
The findings of this paper underscore the persistent vulnerabilities of machine learning systems to adversarial attacks, even under restrictive information and query budgets. Practically, this work suggests the need for improved defensive mechanisms in real-world applications where access to model details and outputs is limited. Theoretically, the approach introduces robust methodologies for conducting black-box attacks, potentially influencing future research directions in adversarial defense strategies and secure machine learning practices.
Speculating on future developments, research might focus on developing algorithms that can preemptively identify or mitigate adversarial examples generated under these constrained settings. Furthermore, extending these methods to domains beyond image classification, such as natural language processing or autonomous systems, could provide comprehensive insights into the broader robustness of machine learning models against adversarial threats.
This paper provides a robust framework for understanding and addressing black-box adversarial threats, contributing significantly to the ongoing discourse on the security and reliability of machine learning systems.