TextBugger: Adversarial Text Attack Framework
- TextBugger is an adversarial attack framework that perturbs text at the token level to induce misclassification in deep learning-based text understanding systems while maintaining semantic similarity.
- It operates in both white-box and black-box settings, employing five targeted bug generation techniques to efficiently reduce classifier confidence.
- Empirical evaluations demonstrate success rates up to 100% with minimal modifications, highlighting significant challenges for current defensive strategies.
TextBugger is a general attack framework for generating adversarial texts designed to induce misclassification in deep learning-based text understanding (DLTU) systems while preserving high semantic similarity and human utility. TextBugger demonstrates the susceptibility of both white-box and black-box DLTU classifiers—including leading sentiment analysis and toxic content detection services—to subtle textual perturbations. Its operational philosophy is to minimally alter text so the intended label is changed, but the input remains semantically and perceptually similar to the original, with most changes undetected even by human readers (Li et al., 2018).
1. Adversarial Example Formulation
TextBugger targets a pretrained classifier , where denotes the document space (as token sequences) and the set of predefined labels. Given benign input with label , the adversary seeks such that
- , for (misclassification), and
- (utility-preservation), where denotes semantic similarity, and is typically set to $0.8$.
Distance measures (“Dist”) include edit distance, token modification fraction, and Jaccard similarity. Semantic similarity is operationalized using the Universal Sentence Encoder to embed sentences, applying cosine similarity:
2. Attack Framework and Workflow
TextBugger operates in both white-box (full gradient access) and black-box (only label/confidence query access) settings. The workflow comprises the following:
- Identification of "important" units (words or sentences) based on impact on prediction confidence.
- For each unit, generation of candidate perturbations (“bugs”) through five transformation strategies.
- Scoring each bug by the reduction in classifier confidence for the original label; the most effective candidate is selected.
- Sequential application of bugs, with the process terminating when misclassification is achieved or semantic similarity drops below threshold.
White-box Algorithm
- Word importance is computed as the Jacobian of true-class confidence with respect to each token.
- Candidate bugs are generated for the most important words.
- The optimal bug (maximally reducing ) is substituted, success is checked, and iterations continue as needed.
Black-box Algorithm
- Stepwise importance: first at the sentence level, then at word level within top sentences.
- Importance scores derive from confidence drop upon token removal.
- The same bug generation and selection strategy is enforced, but without gradient information.
3. Bug Generation Techniques
For each important word, exactly five candidate perturbations (bugs) are considered:
- Insert: Insert a space into a word of length .
- Delete: Randomly delete a non-endpoint character.
- Swap: Swap two adjacent non-endpoint characters (words chars only).
- Sub-C: Replace a character with a visually or keyboard-adjacent character, e.g., 'o' '0'.
- Sub-W: Substitute the token with one of its closest semantic neighbors in the pre-trained GloVe embedding space, if similarity remains high.
The single bug inducing the maximal decrease in classifier confidence is applied before proceeding to the next word.
| Perturbation | Level | Description |
|---|---|---|
| Insert | Character | Add a space to word (6 chars) |
| Delete | Character | Remove random non-endpoint character |
| Swap | Character | Swap two adjacent non-endpoint letters (4 chars) |
| Sub-C | Character | Substitute with visually/keyboard-adjacent character |
| Sub-W | Word | Replace with most similar word (embedding neighbor) |
4. Computational Efficiency
The design ensures sublinear complexity with respect to sequence length :
- White-box: for full-Jacobian computation, but early stopping is typical (few tokens are modified).
- Black-box: Processing is typically limited to sentences and tokens, so total complexity is sublinear in .
In empirical trials, adversarial sample generation times spanned milliseconds (fastText) to ”tens of seconds” (cloud APIs), significantly outpacing random or exhaustive search strategies.
5. Evaluation and Empirical Findings
Experimental evaluation encompasses sentiment analysis (IMDB, Rotten Tomatoes MR) and toxic content detection (Kaggle Toxic Comment), offline models (LR, CNN, LSTM), and several commercial/cloud APIs (Google, AWS, Azure, etc.). Outcomes include:
- IMDB, LR (white-box): 95.2% attack success (@~4.9% word perturbation), compared with ≤41% for FGSM+NNS, DeepFool+NNS.
- IMDB, AWS Comprehend (black-box): 100% success, 4.61s/sample, only 1.2% average word perturbation, semantic similarity ~0.97.
- IMDB, Azure (black-box): 100% success, 23.01s/sample, 5.7% word perturbation.
- Perspective API (toxicity): 60.1% success (TextBugger) vs 33.5% (DeepWordBug), perturbing 5.6% tokens.
Utility preservation is empirically supported:
- Over 90% of adversarial texts preserve semantic similarity .
- MTurk human studies indicate 94.9% of adversarial samples are assigned identical labels as original and only 30% of bugs are detected.
Transferability is demonstrated, with adversarial texts crafted for one model misclassifying other systems at rates up to 40%+.
6. Defense Strategies and Adversarial Robustness
Two mitigation approaches are considered:
- Spelling Correction: Cloud-based spell-checkers (e.g., Azure Spell-Check API) can partially repair Insert/Delete/Swap/Sub-C perturbations, reducing black-box attack rates (e.g., IMDBAWS: 100% to 20.8%). Sub-W bugs (<10% correction) remain effective and thus can be preferentially employed to evade detection.
- Adversarial Training: Incorporating adversarially perturbed samples in retraining (2,000 examples, 10 epochs) sharply reduces attack success (e.g., IMDB/LR: 95%28%; LSTM: 90%12%), with minimal loss of benign accuracy. However, this requires attack knowledge and a stream of representative adversarial examples.
Attackers can adapt by varying bug types or holding strategies private, circumventing both defenses.
7. Key Insights, Transferability, and Limitations
TextBugger demonstrates that state-of-the-art DLTU models and real-world APIs—regardless of system or application domain—are highly vulnerable to minimal, semantically-preserving adversarial perturbations. Its main strengths are:
- Effectiveness: Consistently high (80%, frequently 100%) attack success rates.
- Evasiveness: Utility and semantics preserved; adversarial texts are rarely distinguishable from benign input by humans.
- Efficiency: Relies on sublinear query complexity and rapid perturbation selection.
- Generality: Applicability across white-box/black-box, multiple domains.
Open challenges remain, notably:
- Current bugs are local (token-level) edits; future work may investigate structured or syntactic transformations, paraphrasing, or advanced search (e.g., beam search).
- Robust defense remains elusive, especially for word-level substitutions.
- Direct extension to targeted misclassification is methodologically straightforward via Jacobian guidance.
TextBugger underscores inherent vulnerabilities in current DLTU deployments and serves as a baseline for developing more resilient textual models (Li et al., 2018).