TextBugger: Adversarial Text Attack Framework

Updated 15 February 2026

TextBugger is an adversarial attack framework that perturbs text at the token level to induce misclassification in deep learning-based text understanding systems while maintaining semantic similarity.
It operates in both white-box and black-box settings, employing five targeted bug generation techniques to efficiently reduce classifier confidence.
Empirical evaluations demonstrate success rates up to 100% with minimal modifications, highlighting significant challenges for current defensive strategies.

TextBugger is a general attack framework for generating adversarial texts designed to induce misclassification in deep learning-based text understanding (DLTU) systems while preserving high semantic similarity and human utility. TextBugger demonstrates the susceptibility of both white-box and black-box DLTU classifiers—including leading sentiment analysis and toxic content detection services—to subtle textual perturbations. Its operational philosophy is to minimally alter text so the intended label is changed, but the input remains semantically and perceptually similar to the original, with most changes undetected even by human readers (Li et al., 2018).

1. Adversarial Example Formulation

TextBugger targets a pretrained classifier $\mathcal{F}:\mathcal{X}\to\mathcal{Y}$ , where $\mathcal{X}$ denotes the document space (as token sequences) and $\mathcal{Y}$ the set of predefined labels. Given benign input $\mathbf{x} \in \mathcal{X}$ with label $y=\mathcal{F}(\mathbf{x})$ , the adversary seeks $\mathbf{x}_{adv}$ such that

$\mathcal{F}(\mathbf{x}_{adv}) = t$ , for $t \ne y$ (misclassification), and
$S(\mathbf{x}, \mathbf{x}_{adv}) \geq \epsilon$ (utility-preservation), where $S$ denotes semantic similarity, and $\epsilon$ is typically set to $0.8$.

Distance measures (“Dist”) include edit distance, token modification fraction, and Jaccard similarity. Semantic similarity is operationalized using the Universal Sentence Encoder to embed sentences, applying cosine similarity:

$S(\mathbf{x}, \mathbf{z}) = \frac{u(\mathbf{x}) \cdot u(\mathbf{z})}{\|u(\mathbf{x})\| \|u(\mathbf{z})\|}.$

2. Attack Framework and Workflow

TextBugger operates in both white-box (full gradient access) and black-box (only label/confidence query access) settings. The workflow comprises the following:

Identification of "important" units (words or sentences) based on impact on prediction confidence.
For each unit, generation of candidate perturbations (“bugs”) through five transformation strategies.
Scoring each bug by the reduction in classifier confidence for the original label; the most effective candidate is selected.
Sequential application of bugs, with the process terminating when misclassification is achieved or semantic similarity drops below threshold.

White-box Algorithm

Word importance is computed as the Jacobian of true-class confidence with respect to each token.
Candidate bugs are generated for the most important words.
The optimal bug (maximally reducing $F_y$ ) is substituted, success is checked, and iterations continue as needed.

Black-box Algorithm

Stepwise importance: first at the sentence level, then at word level within top sentences.
Importance scores derive from confidence drop upon token removal.
The same bug generation and selection strategy is enforced, but without gradient information.

3. Bug Generation Techniques

For each important word, exactly five candidate perturbations (bugs) are considered:

Insert: Insert a space into a word of length $<6$ .
Delete: Randomly delete a non-endpoint character.
Swap: Swap two adjacent non-endpoint characters (words $>4$ chars only).
Sub-C: Replace a character with a visually or keyboard-adjacent character, e.g., 'o' $\rightarrow$ '0'.
Sub-W: Substitute the token with one of its $k=5$ closest semantic neighbors in the pre-trained GloVe embedding space, if similarity remains high.

The single bug inducing the maximal decrease in classifier confidence is applied before proceeding to the next word.

Perturbation	Level	Description
Insert	Character	Add a space to word ( $<$ 6 chars)
Delete	Character	Remove random non-endpoint character
Swap	Character	Swap two adjacent non-endpoint letters ( $>$ 4 chars)
Sub-C	Character	Substitute with visually/keyboard-adjacent character
Sub-W	Word	Replace with most similar word (embedding neighbor)

4. Computational Efficiency

The design ensures sublinear complexity with respect to sequence length $N$ :

White-box: $O(N)$ for full-Jacobian computation, but early stopping is typical (few tokens are modified).
Black-box: Processing is typically limited to $M \ll N$ sentences and $\ell\ll N$ tokens, so total complexity $O(M + \sum_{i=1}^\ell k_i)$ is sublinear in $N$ .

In empirical trials, adversarial sample generation times spanned milliseconds (fastText) to $\sim$ ”tens of seconds” (cloud APIs), significantly outpacing random or exhaustive search strategies.

5. Evaluation and Empirical Findings

Experimental evaluation encompasses sentiment analysis (IMDB, Rotten Tomatoes MR) and toxic content detection (Kaggle Toxic Comment), offline models (LR, CNN, LSTM), and several commercial/cloud APIs (Google, AWS, Azure, etc.). Outcomes include:

IMDB, LR (white-box): 95.2% attack success (@~4.9% word perturbation), compared with ≤41% for FGSM+NNS, DeepFool+NNS.
IMDB, AWS Comprehend (black-box): 100% success, 4.61s/sample, only 1.2% average word perturbation, semantic similarity ~0.97.
IMDB, Azure (black-box): 100% success, 23.01s/sample, 5.7% word perturbation.
Perspective API (toxicity): 60.1% success (TextBugger) vs 33.5% (DeepWordBug), perturbing 5.6% tokens.

Utility preservation is empirically supported:

Over 90% of adversarial texts preserve semantic similarity $\ge 0.9$ .
MTurk human studies indicate 94.9% of adversarial samples are assigned identical labels as original and only 30% of bugs are detected.

Transferability is demonstrated, with adversarial texts crafted for one model misclassifying other systems at rates up to 40%+.

6. Defense Strategies and Adversarial Robustness

Two mitigation approaches are considered:

Spelling Correction: Cloud-based spell-checkers (e.g., Azure Spell-Check API) can partially repair Insert/Delete/Swap/Sub-C perturbations, reducing black-box attack rates (e.g., IMDB $\rightarrow$ AWS: 100% to 20.8%). Sub-W bugs (<10% correction) remain effective and thus can be preferentially employed to evade detection.
Adversarial Training: Incorporating adversarially perturbed samples in retraining (2,000 examples, 10 epochs) sharply reduces attack success (e.g., IMDB/LR: 95% $\rightarrow$ 28%; LSTM: 90% $\rightarrow$ 12%), with minimal loss of benign accuracy. However, this requires attack knowledge and a stream of representative adversarial examples.

Attackers can adapt by varying bug types or holding strategies private, circumventing both defenses.

7. Key Insights, Transferability, and Limitations

TextBugger demonstrates that state-of-the-art DLTU models and real-world APIs—regardless of system or application domain—are highly vulnerable to minimal, semantically-preserving adversarial perturbations. Its main strengths are:

Effectiveness: Consistently high ( $>$ 80%, frequently 100%) attack success rates.
Evasiveness: Utility and semantics preserved; adversarial texts are rarely distinguishable from benign input by humans.
Efficiency: Relies on sublinear query complexity and rapid perturbation selection.
Generality: Applicability across white-box/black-box, multiple domains.

Open challenges remain, notably:

Current bugs are local (token-level) edits; future work may investigate structured or syntactic transformations, paraphrasing, or advanced search (e.g., beam search).
Robust defense remains elusive, especially for word-level substitutions.
Direct extension to targeted misclassification is methodologically straightforward via Jacobian guidance.

TextBugger underscores inherent vulnerabilities in current DLTU deployments and serves as a baseline for developing more resilient textual models (Li et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

TextBugger: Generating Adversarial Text Against Real-world Applications (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TextBugger.