Contextualized Perturbation for Textual Adversarial Attack (2009.07502v2)

Published 16 Sep 2020 in cs.CL

Abstract: Adversarial examples expose the vulnerabilities of NLP models, and can be used to evaluate and improve their robustness. Existing techniques of generating such examples are typically driven by local heuristic rules that are agnostic to the context, often resulting in unnatural and ungrammatical outputs. This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs through a mask-then-infill procedure. CLARE builds on a pre-trained masked LLM and modifies the inputs in a context-aware manner. We propose three contextualized perturbations, Replace, Insert and Merge, allowing for generating outputs of varied lengths. With a richer range of available strategies, CLARE is able to attack a victim model more efficiently with fewer edits. Extensive experiments and human evaluation demonstrate that CLARE outperforms the baselines in terms of attack success rate, textual similarity, fluency and grammaticality.

Authors (7)

Dianqi Li (18 papers)
Yizhe Zhang (127 papers)
Hao Peng (291 papers)
Liqun Chen (42 papers)
Chris Brockett (37 papers)
Ming-Ting Sun (16 papers)
Bill Dolan (45 papers)

Citations (223)

View on Semantic Scholar

Summary

The paper introduces CLARE, a novel model that employs replace, insert, and merge strategies for generating effective, contextually perturbed adversarial examples.
The methodology leverages a mask-then-infill procedure with pre-trained language models to ensure high semantic similarity and grammatical accuracy.
Quantitative results demonstrate that CLARE achieves superior attack success rates and maintains textual fluency, paving the way for more robust NLP defenses.

Contextualized Perturbation for Textual Adversarial Attack

The paper explores a significant challenge in the field of NLP—specifically, the generation of adversarial examples to evaluate and enhance the robustness of NLP systems. The authors introduce CLARE, a ContextuaLized AdversaRial Example generation model, which leverages a mask-then-infill procedure based on pre-trained masked LLMs. This approach enhances the fluency, grammaticality, and effectiveness of adversarial examples.

Key Innovations and Methodology

CLARE departs from traditional methods that often rely on heuristic, context-agnostic rules, such as synonym replacement. To address the shortcomings of these methods, which frequently lead to unnatural outputs, CLARE employs three core perturbation strategies:

Replace: Substitutes a word with another, in a contextually aware manner.
Insert: Adds a word without compromising the sentence structure.
Merge: Combines two adjacent words into a single contextually appropriate word.

These perturbation strategies allow CLARE to produce adversarial examples of varied lengths, offering a flexible approach to perturb text inputs effectively with fewer edits compared to existing methods. The incorporation of a pre-trained model like RoBERTa ensures that the generated text maintains high levels of similarity to the original while achieving a higher attack success rate.

Quantitative Results and Comparative Analysis

The efficacy of CLARE is supported through extensive experimentation across diverse datasets, including text classification and natural language inference tasks. The model demonstrates superior performance relative to existing baselines in key metrics:

Attack Success Rate: CLARE consistently achieves a higher attack success rate, indicating its ability to produce adversarial examples that are more effective in deceiving NLP models.
Textual Similarity: The model excels in preserving the semantic content, reflected in the higher similarity scores.
Fluency and Grammaticality: Evaluations show reduced perplexity and grammatical errors, a testament to the quality of the generated text.

Moreover, in human evaluations, CLARE's adversarial examples were rated higher for maintaining meaning and grammatical accuracy compared to alternatives like TextFooler.

Implications and Future Directions

CLARE presents significant implications for the development of robust NLP systems. By producing more human-like adversarial examples, researchers can better understand model vulnerabilities and devise more effective defenses. In practical terms, the model's ability to produce cleaner adversarial text positions it as a tool for enhancing model training via adversarial training, thus improving overall robustness and performance.

Looking forward, this work opens avenues for further refinement of contextual adversarial methods, possibly extending to more nuanced language tasks such as dialogue systems or cross-lingual models. The integration of such adversarial methodologies in the training loop represents a frontier in model robustness, especially as NLP applications continue to gain complexity and prominence.

Overall, CLARE significantly advances the scope of adversarial example generation in NLP, offering a framework that balances effectiveness with linguistic integrity. The open-source release of its models paves the way for continued exploration and integration into diverse NLP endeavors.

PDF Markdown