SemRoDe: Macro Adversarial Training to Learn Representations That are Robust to Word-Level Attacks
Abstract: LLMs (LMs) are indispensable tools for natural language processing tasks, but their vulnerability to adversarial attacks remains a concern. While current research has explored adversarial training techniques, their improvements to defend against word-level attacks have been limited. In this work, we propose a novel approach called Semantic Robust Defence (SemRoDe), a Macro Adversarial Training strategy to enhance the robustness of LMs. Drawing inspiration from recent studies in the image domain, we investigate and later confirm that in a discrete data setting such as language, adversarial samples generated via word substitutions do indeed belong to an adversarial domain exhibiting a high Wasserstein distance from the base domain. Our method learns a robust representation that bridges these two domains. We hypothesize that if samples were not projected into an adversarial domain, but instead to a domain with minimal shift, it would improve attack robustness. We align the domains by incorporating a new distance-based objective. With this, our model is able to learn more generalized representations by aligning the model's high-level output features and therefore better handling unseen adversarial samples. This method can be generalized across word embeddings, even when they share minimal overlap at both vocabulary and word-substitution levels. To evaluate the effectiveness of our approach, we conduct experiments on BERT and RoBERTa models on three datasets. The results demonstrate promising state-of-the-art robustness.
- Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998.
- Defending pre-trained language models from adversarial word substitutions without performance sacrifice.
- Samuel Barham and Soheil Feizi. 2019. Interpretable adversarial training for text.
- Optimal transport as a defense against adversarial attacks. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE.
- Universal sentence encoder. arXiv preprint arXiv:1803.11175.
- Adversarial attacks and defences: A survey.
- Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transportation distances.
- Bert: Pre-training of deep bidirectional transformers for language understanding.
- Mma training: Direct input space margin maximization through adversarial training.
- Towards robustness against natural language word substitutions.
- HotFlip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 31–36, Melbourne, Australia. Association for Computational Linguistics.
- Steffen Eger and Yannik Benz. 2020. From hero to z’eroe: A benchmark of low-level adversarial attacks. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 786–803, Suzhou, China. Association for Computational Linguistics.
- Text processing like humans do: Visually attacking and shielding NLP systems.
- Jean Feydy. 2020. Geometric data analysis, beyond convolutions.
- Interpolating between optimal transport and mmd using sinkhorn divergences.
- Using punctuation as an adversarial attack on deep learning-based NLP systems: An empirical study. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1–34, Dubrovnik, Croatia. Association for Computational Linguistics.
- Special symbol attacks on nlp systems. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8.
- Black-box generation of adversarial text sequences to evade deep learning classifiers.
- DSRM: Boost textual adversarial training with distribution shift risk minimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12177–12189, Toronto, Canada. Association for Computational Linguistics.
- Explaining and harnessing adversarial examples.
- A kernel method for the two-sample problem.
- Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4129–4142, Hong Kong, China. Association for Computational Linguistics.
- Is bert really robust? a strong baseline for natural language attack on text classification and entailment.
- Adversarial logit pairing.
- TextBugger: Generating adversarial text against real-world applications. In Proceedings 2019 Network and Distributed System Security Symposium. Internet Society.
- Bert-attack: Adversarial attack against bert using bert.
- Searching for an effective defender: Benchmarking defense against adversarial word substitution.
- Flooding-X: Improving BERT’s resistance to adversarial attacks via loss-restricted fine-tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5634–5644, Dublin, Ireland. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach.
- Towards deep learning models resistant to adversarial attacks.
- Generating natural language attacks in a hard label black box setting. In AAAI Conference on Artificial Intelligence.
- George A. Miller. 1995. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41.
- Adversarial training methods for semi-supervised text classification.
- A unifying view on dataset shift in classification. Pattern Recognition, 45(1):521–530.
- Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp.
- “that is a suspicious reaction!”: Interpreting logits variation to detect NLP adversarial attacks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
- Counter-fitting word vectors to linguistic constraints.
- Dang Minh Nguyen and Luu Anh Tuan. 2022. Textual manifold-based defense against natural language adversarial examples.
- Improving mini-batch optimal transport via partial transportation. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, page 16656–16690. PMLR.
- Thong Nguyen and Luu Anh Tuan. 2021. Improving neural cross-lingual summarization via employing optimal transport distance for knowledge distillation.
- Domain adaptation via transfer component analysis.
- GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks.
- Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1085–1097, Florence, Italy. Association for Computational Linguistics.
- Interpretable adversarial perturbation in input embedding space for text.
- Textshield: Beyond successfully detecting adversarial sentences in text classification.
- Improving the generalization of adversarial training with domain adaptation.
- Return of frustratingly easy domain adaptation.
- It’s morphin’ time! combating linguistic discrimination with inflectional perturbations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- On adaptive attacks to adversarial example defenses. In Advances in Neural Information Processing Systems, volume 33, pages 1633–1645. Curran Associates, Inc.
- Infobert: Improving robustness of language models from an information theoretic perspective.
- On the robustness of chatgpt: An adversarial and out-of-distribution perspective.
- Adversarial demonstration attacks on large language models.
- Natural language adversarial defense through synonym encoding. In Conference on Uncertainty in Artificial Intelligence.
- Adversarial training with fast gradient projection method against synonym substitution based text attacks.
- Robustness-aware word embedding improves certified robustness to adversarial word substitutions. In Findings of the Association for Computational Linguistics: ACL 2023, pages 673–687, Toronto, Canada. Association for Computational Linguistics.
- Improving adversarial robustness requires revisiting misclassified examples. In International Conference on Learning Representations.
- Toward adversarial training on contextualized language representation.
- SAFER: A structure-free approach for certified robustness to adversarial word substitutions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3465–3475, Online. Association for Computational Linguistics.
- Jin Yong Yoo and Yanjun Qi. 2021. Towards improving adversarial training of nlp models.
- Word-level textual adversarial attacking as combinatorial optimization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- Theoretically principled trade-off between robustness and accuracy.
- Alignment attention by matching key and query distributions.
- Certified robustness against natural language attacks by causal intervention.
- Freelb: Enhanced adversarial training for natural language understanding.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.