Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer (2110.07139v1)

Published 14 Oct 2021 in cs.CL, cs.AI, and cs.CR

Abstract: Adversarial attacks and backdoor attacks are two common security threats that hang over deep learning. Both of them harness task-irrelevant features of data in their implementation. Text style is a feature that is naturally irrelevant to most NLP tasks, and thus suitable for adversarial and backdoor attacks. In this paper, we make the first attempt to conduct adversarial and backdoor attacks based on text style transfer, which is aimed at altering the style of a sentence while preserving its meaning. We design an adversarial attack method and a backdoor attack method, and conduct extensive experiments to evaluate them. Experimental results show that popular NLP models are vulnerable to both adversarial and backdoor attacks based on text style transfer -- the attack success rates can exceed 90% without much effort. It reflects the limited ability of NLP models to handle the feature of text style that has not been widely realized. In addition, the style transfer-based adversarial and backdoor attack methods show superiority to baselines in many aspects. All the code and data of this paper can be obtained at https://github.com/thunlp/StyleAttack.

Citations (148)

View on Semantic Scholar

Summary

The paper demonstrates that text style transfer can generate both adversarial examples and covert backdoor triggers with over 90% success across major NLP architectures.
It leverages unsupervised style transfer techniques (STRAP and StyleBkd) to alter text style while preserving semantic meaning, thereby deceiving model predictions.
Experimental results on sentiment analysis, hate speech detection, and topic classification emphasize the urgent need for defenses against stylistic manipulations in NLP security.

Analysis of Adversarial and Backdoor Attacks Utilizing Text Style Transfer

The paper entitled "Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer" is an exploration of leveraging text style transfer to perform adversarial and backdoor attacks on NLP models. These attacks exploit the inherent weakness of current neural networks when faced with stylistic variations in text, demonstrating their vulnerability.

The authors focus on two prevalent security threats to deep learning models: adversarial attacks, which occur during model inference, and backdoor attacks, which are embedded during model training. Both approaches utilize task-irrelevant features, namely text style, as their operational basis, which is detached from the semantic content that typically informs NLP tasks.

Adversarial Attacks via Style Transfer

Adversarial attacks create perturbations in input data to mislead model predictions. By transforming input text into various styles using a model called STRAP—an efficient unsupervised text style transfer system—the authors create adversarial examples that maintain semantic integrity but alter stylistic features. The paper reports an attack success rate exceeding 90% across multiple popular NLP models, highlighting the weakness of these models in addressing stylistic changes.

Backdoor Attacks via Style Transfer

Backdoor attacks involve introducing triggers into the training data, resulting in models that behave normally with clean inputs but return specified attacker-friendly outputs when triggered. Here, text style serves as the trigger—an abstract feature compared to typical content-based triggers. With a comprehensive experimental approach, StyleBkd (the proposed backdoor attack method) achieves over 90% attack success rate even in defense scenarios, showcasing robust invisibility to common backdoor defenses.

Evaluation and Results

The authors employ three datasets—SST-2 for sentiment analysis, HS for hate speech detection, and AG's News for topic classification—and three well-known NLP architectures, namely BERT, ALBERT, and DistilBERT, to illustrate their propositions. Both adversarial and backdoor attacks using text style transfer report high effectiveness and intrinsic quality in crafted examples. Specifically notable is the ability to deceive models into misclassification successfully while preserving the semantics and fluency of the text altered via stylistic paraphrasing.

Implications and Future Work

The practical implications of this research are multifaceted. On one hand, it introduces a novel mechanism to develop more aggressive security attacks on NLP systems. On the other hand, it highlights a critical vulnerability in existing NLP models' architecture, prompting future research to focus on enhancing robustness against stylistic manipulations.

Potential theoretical implications involve the need to investigate the role of style separation and incorporation into NLP algorithms to mitigate the effects of such attacks. Future research could involve designing defenses that augment training datasets with diverse stylistic variations or apply real-time style normalization mechanisms during inference.

Conclusion

Through well-founded methodologies and extensive experimental evaluations, this paper emphasizes the vulnerability of widely adopted NLP systems to text style manipulations. Style-based adversarial and backdoor attacks possess significant ramifications for the security of NLP applications, calling for urgent attention from the research community to develop robust defenses and increased stylistic awareness in model design.

The insights presented in this paper forge a path for new developments in AI security, while posing significant challenges to the traditional design paradigm of NLP models, urging a discernible shift towards stylistically robust neural networks.

PDF Markdown

Related Papers

GitHub

GitHub - thunlp/StyleAttack: Code and data of the EMNLP 2021 paper "Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer" (39 stars)