TextBugger: Generating Adversarial Text Against Real-world Applications (1812.05271v1)

Published 13 Dec 2018 in cs.CR, cs.CL, and cs.LG

Abstract: Deep Learning-based Text Understanding (DLTU) is the backbone technique behind various applications, including question answering, machine translation, and text classification. Despite its tremendous popularity, the security vulnerabilities of DLTU are still largely unknown, which is highly concerning given its increasing use in security-sensitive applications such as sentiment analysis and toxic content detection. In this paper, we show that DLTU is inherently vulnerable to adversarial text attacks, in which maliciously crafted texts trigger target DLTU systems and services to misbehave. Specifically, we present TextBugger, a general attack framework for generating adversarial texts. In contrast to prior works, TextBugger differs in significant ways: (i) effective -- it outperforms state-of-the-art attacks in terms of attack success rate; (ii) evasive -- it preserves the utility of benign text, with 94.9\% of the adversarial text correctly recognized by human readers; and (iii) efficient -- it generates adversarial text with computational complexity sub-linear to the text length. We empirically evaluate TextBugger on a set of real-world DLTU systems and services used for sentiment analysis and toxic content detection, demonstrating its effectiveness, evasiveness, and efficiency. For instance, TextBugger achieves 100\% success rate on the IMDB dataset based on Amazon AWS Comprehend within 4.61 seconds and preserves 97\% semantic similarity. We further discuss possible defense mechanisms to mitigate such attack and the adversary's potential countermeasures, which leads to promising directions for further research.

Citations (687)

View on Semantic Scholar

Summary

The paper introduces TextBugger, a framework that generates adversarial texts using character- and word-level perturbations in both white-box and black-box settings.
It demonstrates high effectiveness by achieving a 100% success rate on the IMDB dataset with 97% semantic similarity, outperforming existing methods.
The study highlights the need for robust defenses in security-sensitive applications, recommending adversarial training and enhanced linguistic processing techniques.

Adversarial Text Attacks with TextBugger Framework

The paper "TextBugger: Generating Adversarial Text Against Real-world Applications" presents a technical exploration of adversarial attacks on deep learning models employed in text classification tasks. The authors introduce a framework named TextBugger, designed to effectively and efficiently generate adversarial texts that maintain semantic similarity with the original text while misleading classification systems.

Background and Motivation

Recent advancements in deep neural networks (DNNs) have significantly improved the performance of systems tasked with text classification, sentiment analysis, and toxic content detection. Despite these advancements, DNN models are vulnerable to adversarial examples—inputs deliberately modified to cause misclassification. This vulnerability is critical given the deployment of DNNs in security-sensitive applications such as spam filtering and recommendation systems.

Methodology

The core contribution of the paper is the TextBugger framework, which generates adversarial text using both character-level and word-level perturbations. It operates under white-box and black-box settings:

White-box Attacks: Assume complete knowledge of the model's parameters. Important words are identified using the Jacobian matrix based on classification confidence, and perturbations are applied strategically.
Black-box Attacks: Rely on minimal information about the model, typically just the prediction scores. Important sentences are first identified, and within them, key words are perturbed using calculated scores.

TextBugger employs five perturbation strategies: insertion, deletion, swapping, substitution with visually similar characters, and substitution with semantically similar words.

Results and Evaluation

The paper presents comprehensive evaluations across various datasets: IMDB, Rotten Tomatoes, and the Kaggle Toxic Comment Classification dataset. TextBugger achieves:

A 100% success rate on the IMDB dataset using Amazon AWS Comprehend, with semantic similarity preserved at 97%.
A significant success rate across multiple real-world platforms, outperforming existing methods such as DeepWordBug, especially under black-box conditions.

The evaluation included metrics like text similarity, computation time, and effectiveness across platforms. TextBugger demonstrated transferability as generated adversarial texts could mislead different models.

Practical Implications

The existence of these adversarial examples highlights the need for robust defense mechanisms. The paper discusses potential defenses such as spell checking and adversarial training, though these have limitations. Adversarial training can increase model robustness, but demands extensive adversarial examples, often impractical against unknown attack methods.

Directions for Future Work

The research prompts several future directions:

Developing more sophisticated algorithms incorporating linguistic processing techniques.
Extending adversarial attack methods to targeted attacks.
Investigating ensemble-based defense mechanisms to improve model resilience.

Conclusion

The paper establishes the efficacy of TextBugger in creating adversarial texts that challenge state-of-the-art classifiers while preserving the original text's semantic integrity. The implications for the field are significant, as it underscores the need for enhanced security strategies in the deployment of machine learning applications in text processing.

PDF Markdown