Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment (1907.11932v6)

Published 27 Jul 2019 in cs.CL, cs.AI, and cs.LG

Abstract: Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TextFooler, a simple but strong baseline to generate natural adversarial text. By applying it to two fundamental natural language tasks, text classification and textual entailment, we successfully attacked three target models, including the powerful pre-trained BERT, and the widely used convolutional and recurrent neural networks. We demonstrate the advantages of this framework in three ways: (1) effective---it outperforms state-of-the-art attacks in terms of success rate and perturbation rate, (2) utility-preserving---it preserves semantic content and grammaticality, and remains correctly classified by humans, and (3) efficient---it generates adversarial text with computational complexity linear to the text length. *The code, pre-trained target models, and test examples are available at https://github.com/jind11/TextFooler.

Authors (4)

Di Jin (104 papers)
Zhijing Jin (68 papers)
Joey Tianyi Zhou (116 papers)
Peter Szolovits (44 papers)

Citations (958)

View on Semantic Scholar

Summary

The paper introduces TextFooler, a novel framework that generates adversarial text examples to expose vulnerabilities in BERT and other models.
The methodology uses word importance ranking and semantic-preserving substitutions to ensure human consistency, meaning retention, and language fluency.
Experimental results reveal drastic accuracy drops, such as 92.2% to 6.6% on IMDB and 90.7% to 4.0% on SNLI, underlining the need for robust defenses.

Analyzing the Robustness of BERT: A Study of TextFooler for Natural Language Attacks

In the current landscape of NLP, the robustness of state-of-the-art models like BERT against adversarial attacks is a subject of significant concern. The paper "Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment" by Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits, addresses this issue by introducing TextFooler, a method to generate adversarial text examples targeting BERT, as well as convolutional (CNN) and recurrent neural networks (RNN such as LSTM).

Overview

TextFooler proposes an adversarial attack framework designed for text data, which is inherently challenging due to its discrete nature. The authors emphasize three critical criteria for generating effective adversarial text:

Human Prediction Consistency: The adversarial text should be perceived similarly by humans, yielding consistent predictions.
Semantic Similarity: The adversarial text must maintain the original content's meaning.
Language Fluency: The generated text should remain grammatically correct and natural.

Methodology

The framework operates under a black-box setting, meaning it does not require access to the model architecture or parameters. TextFooler involves a two-step process:

Word Importance Ranking: Important words, which significantly impact the prediction, are identified by measuring the change in prediction score upon their removal.
Word Replacement: High-importance words are replaced with semantically similar alternatives while maintaining grammatical correctness and ensuring a high semantic similarity to the original text.

Across five text classification tasks and two textual entailment tasks, the authors demonstrate that TextFooler efficiently reduces model accuracy with minimal perturbations. For text classification, datasets such as AG's News, Fake News, MR, IMDB, and Yelp were used. For textual entailment, SNLI and MultiNLI datasets were considered.

Experimental Results

The experimental results show that TextFooler achieved substantial success in misleading models across various datasets. For example, on the IMDB dataset, the attack reduced BERT's accuracy from 92.2% to 6.6% with only 6.1% of words perturbed. Similarly strong results were observed on other datasets like SNLI, where the accuracy dropped from 90.7% to 4.0%.

Practical and Theoretical Implications

The findings of this paper are significant both practically and theoretically. From a practical perspective, the vulnerability of BERT and other advanced models underscores the necessity for improved adversarial robustness in model deployment, especially in sensitive applications like fake news detection. Theoretically, the results provide insights into model interpretability, highlighting the crucial words and phrases that contribute to model decisions.

Future Directions

Looking ahead, the potential development of more sophisticated adversarial training techniques could significantly enhance model robustness. By incorporating the generated adversarial examples into the training process, models may become more resilient against such attacks. Furthermore, expanding the methods for automatic semantic similarity evaluation and grammar checking could refine the quality of adversarial examples, making them even harder for models to detect.

Conclusion

The authors' contributions provide a substantial advancement in our understanding of model robustness in NLP. The introduction of TextFooler reveals critical vulnerabilities in current models and sets a strong foundation for future research aimed at bolstering the defenses of NLP systems against adversarial attacks. The open-sourcing of the code and resources further facilitates ongoing research and benchmarking in the field. As the field progresses, these insights will be vital for developing more secure and reliable AI systems.

Related Papers

GitHub

GitHub - jind11/TextFooler: A Model for Natural Language Attack on Text Classification and Inference (485 stars)

Tweets

https://twitter.com/ZhijingJin/status/1747998383552209217