BERT-ATTACK: Adversarial Attack Against BERT Using BERT

Published 21 Apr 2020 in cs.CL | (2004.09984v3)

Abstract: Adversarial attacks for discrete data (such as texts) have been proved significantly more challenging than continuous data (such as images) since it is difficult to generate adversarial samples with gradient-based methods. Current successful attack methods for texts usually adopt heuristic replacement strategies on the character or word level, which remains challenging to find the optimal solution in the massive space of possible combinations of replacements while preserving semantic consistency and language fluency. In this paper, we propose \textbf{BERT-Attack}, a high-quality and effective method to generate adversarial samples using pre-trained masked LLMs exemplified by BERT. We turn BERT against its fine-tuned models and other deep neural models in downstream tasks so that we can successfully mislead the target models to predict incorrectly. Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage, while the generated adversarial samples are fluent and semantically preserved. Also, the cost of calculation is low, thus possible for large-scale generations. The code is available at https://github.com/LinyangLee/BERT-Attack.

Abstract PDF Upgrade to Chat

Citations (641)

View on Semantic Scholar

Summary

The paper presents BERT-Attack, a novel method that leverages BERT to identify key vulnerable words and generate context-aware adversarial text.
The approach achieves impressive results, reducing BERT model accuracy to under 10% while perturbing less than 10% of the input text.
BERT-Attack offers both computational efficiency and practical insights for enhancing the robustness of NLP models against adversarial threats.

BERT-Attack: Adversarial Attack Against BERT Using BERT

The paper "BERT-ATTACK: Adversarial Attack Against BERT Using BERT" presents a methodology for generating adversarial examples targeting BERT-based models. The research addresses the challenge of crafting adversarial samples for discrete data, such as text, which is notably more complex than continuous data, like images, due to the discrete nature and syntactic nuances of language.

Summary of the Approach

The proposed method, BERT-Attack, employs BERT as both an attacker and a target. By leveraging the masked LLM capabilities of BERT, the authors introduce a sophisticated two-step attack strategy that boasts both high success rates and semantic coherence in generated adversarial samples.

Identifying Vulnerable Words: The method begins with pinpointing words within the input text that are crucial for model predictions. This is achieved through a ranking procedure based on the impact each word has on the prediction score when masked. Only a selected few with the highest importance are considered for perturbation.
Word Replacement: Once vulnerable words are identified, BERT facilitates generating semantically consistent words to replace them, thereby crafting adversarial examples. This replacement process is contextually aware, allowing for grammatically correct and meaningful perturbations.

Evaluation and Results

The evaluation of BERT-Attack demonstrated its robustness and efficiency across multiple NLP tasks, including text classification and natural language inference. The research highlights several strong numerical outcomes:

Attack Success Rate: The method achieved a substantial decrease in prediction accuracy of fine-tuned BERT models, with attack success rates bringing accuracy to below 10% in certain tasks.
Perturbation Efficiency: The percentage of words perturbed was notably low, often under 10%, which supports the adversarial examples' semantic preservation.
Computational Efficiency: The BERT-Attack runs significantly faster compared to prior methods, notably outperforming existing adversarial attack strategies in terms of computational cost and query efficiency.

Implications and Future Directions

The implications of this work are multifaceted, spanning both practical applications and theoretical considerations in adversarial robustness. Practically, the method provides an efficient and precise tool for testing and potentially augmenting the robustness of NLP models. Theoretically, it underscores the vulnerabilities present in even state-of-the-art models like BERT, highlighting the need for continued exploration into adversarial attacks specific to discrete data.

The paper suggests several future directions and refinements, including improving semantic coherence by enhancing masked LLMs to avoid selecting antonyms or unrelated terms. Moreover, the adaptability of BERT-Attack across diverse models implies potential for broader applications and the development of enhanced models resistant to such adversarial strategies.

This research contributes to the ongoing dialogue on the resilience of deep learning systems, specifically within NLP, against adversarial threats and underscores the necessity for developing sophisticated countermeasures. Overall, the work is a substantial addition to the existing literature on adversarial machine learning, particularly within the domain of natural language processing.

Markdown