Towards Improving Adversarial Training of NLP Models (2109.00544v2)

Published 1 Sep 2021 in cs.CL, cs.AI, and cs.LG

Abstract: Adversarial training, a method for learning robust deep neural networks, constructs adversarial examples during training. However, recent methods for generating NLP adversarial examples involve combinatorial search and expensive sentence encoders for constraining the generated instances. As a result, it remains challenging to use vanilla adversarial training to improve NLP models' performance, and the benefits are mainly uninvestigated. This paper proposes a simple and improved vanilla adversarial training process for NLP models, which we name Attacking to Training (A2T). The core part of A2T is a new and cheaper word substitution attack optimized for vanilla adversarial training. We use A2T to train BERT and RoBERTa models on IMDB, Rotten Tomatoes, Yelp, and SNLI datasets. Our results empirically show that it is possible to train robust NLP models using a much cheaper adversary. We demonstrate that vanilla adversarial training with A2T can improve an NLP model's robustness to the attack it was originally trained with and also defend the model against other types of word substitution attacks. Furthermore, we show that A2T can improve NLP models' standard accuracy, cross-domain generalization, and interpretability. Code is available at https://github.com/QData/Textattack-A2T .

View on arXiv

Authors (2)

Jin Yong Yoo (6 papers)
Yanjun Qi (68 papers)

Citations (118)

View on Semantic Scholar

Summary

Towards Improving Adversarial Training of NLP Models

The paper presented by Jin Yong Yoo and Yanjun Qi proposes advancements in adversarial training for NLP models, targeting robustness, generalization, and interpretability enhancements. In past research, generating adversarial examples for NLP tasks often required computationally intensive combinatorial searches and complex sentence encoders, hindering the practical implementation and benefits of adversarial training. This work introduces an optimized adversarial training method, termed Attacking to Training (A2T), which leverages a novel, efficient word substitution attack mechanism tailored for this training paradigm.

Methodological Insights

The core innovation of the paper revolves around the A2T framework, which provides a streamlined approach to adversarial example generation. Unlike traditional methods, A2T employs a gradient-based ranking technique for word importance, which replaces combinatorial searches with a swift gradient calculation to prioritize substitutions. This significantly reduces computational demands, allowing robust training without exhaustive resource expenditure. Additionally, the method integrates a semantic similarity constraint using DistilBERT’s semantic textual similarity capabilities to maintain the linguistic integrity of adversarial examples, enhancing both quality and computational efficiency.

The paper also explores a variant method, A2T-MLM, utilizing BERT’s masked LLM for word substitutions. This approach, while preserving contextual coherence more effectively than embedding-based substitutions, tends to compromise semantic preservation, a trade-off reflected in experimental outcomes.

Empirical Findings

Experiments conducted across various sentiment analysis and natural language inference datasets—using models like BERT and RoBERTa—expose notable improvements in adversarial robustness and standard accuracy. Specifically, adversarial training with A2T diminished attack success rates largely, demonstrating resilience against a range of attacks, both those employed during training and external ones from literature (e.g., TextFooler, BAE). These defensive benefits translate to increased standard accuracy and cross-domain generalization capabilities, exemplifying the regularization effect imparted by adversarial training without the typical cost to model generalization often observed in similar endeavors.

An intriguing aspect of the findings is the differential impact of A2T and A2T-MLM, where the former consistently outperformed in both robustness and generalization contexts. This delineates a preference for embedding-based substitution strategies over masked LLMs in adversarial scenarios concerning broader generalization and semantic integrity.

Interpretability Concerns

The paper further investigates the implications of adversarial training on model interpretability, a dimension often overshadowed by defensive aspirations. Using LIME-generated explanations, the research evaluates the Area Over Perturbation Curve (AOPC) scores, revealing enhancements in model interpretability post A2T intervention. The capacity to generate faithful explanations under perturbation reinforces adversarial training’s twin benefit, nurturing models that are both robust and intelligible.

Practical and Theoretical Implications

The proposed method manifests critical implications in both practical and theoretical spheres of NLP model deployment. Practically, reduced computational overheads mean increased accessibility to robust training solutions even within restricted resource settings. Theoretically, A2T encourages future discourse on embedding efficient adversarial techniques within broader generalization frameworks, sparking possibilities for dual-purpose model training pursuits—where robustness coexists with generalization prowess.

The work anticipates transformations in the landscape of NLP adversarial training, with aspirations toward refining attack generation methodologies further and exploring integration within expansive NLP systems. Additionally, considerations for the fine-tuning of adversarial training parameters like the number of adversarial examples or the choice of word substitution strategy could unlock new pathways to achieving optimal models, well-equipped to navigate adversarial challenges in dynamic, ever-evolving data terrains.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - QData/TextAttack-A2T: A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings) (26 stars)