Gradient-based Adversarial Attacks against Text Transformers (2104.13733v1)

Published 15 Apr 2021 in cs.CL, cs.AI, cs.CR, and cs.LG

Abstract: We propose the first general-purpose gradient-based attack against transformer models. Instead of searching for a single adversarial example, we search for a distribution of adversarial examples parameterized by a continuous-valued matrix, hence enabling gradient-based optimization. We empirically demonstrate that our white-box attack attains state-of-the-art attack performance on a variety of natural language tasks. Furthermore, we show that a powerful black-box transfer attack, enabled by sampling from the adversarial distribution, matches or exceeds existing methods, while only requiring hard-label outputs.

Citations (192)

View on Semantic Scholar

Summary

The paper introduces GBDA, a novel framework that uses gradient optimization to generate adversarial text examples with high attack success rates.
It employs a Gumbel-softmax mechanism along with soft constraints like BERTScore and language model perplexity to ensure fluency and semantic similarity.
Experiments demonstrate that GBDA reduces transformer model accuracy to below 10% in white-box settings while maintaining cosine similarity scores above 0.8.

Gradient-based Adversarial Attacks against Text Transformers

The paper "Gradient-based Adversarial Attacks against Text Transformers" presents a novel framework for executing gradient-based attacks on transformer models in natural language processing. This approach, termed GBDA (Gradient-based Distributional Attack), introduces a paradigm shift in constructing adversarial examples by defining a distribution of adversarial examples, as opposed to the conventional single instance approach.

Methodology and Framework

The GBDA employs two key innovations to overcome prevailing challenges in adversarial attacks on text data. Firstly, it establishes a parameterized adversarial distribution via the Gumbel-softmax mechanism, facilitating gradient-based optimization while circumventing the limitations posed by the discrete nature of text data. Secondly, it incorporates soft constraints to enforce fluency and semantic similarity using BERTScore and LLM perplexity, both of which are differentiable and integrated into the optimization objective.

The construction of adversarial examples in GBDA is framed as an optimization problem where the objective function comprises not only the adversarial loss but also constraints on fluency and semantic similarity. This differentiable approach allows for minimizing these components cohesively using gradient descent methods.

Experimental Results and Performance

Empirical evaluations demonstrate that GBDA achieves superior attack success rates in white-box settings against models such as GPT-2, BERT, and XLM, often reducing accuracy to below 10%. These attacks maintain high semantic similarity, evidenced by cosine similarity scores above 0.8 in most cases. Furthermore, the framework shows robustness when applied in transfer attacks, maintaining competitive performance against various models including ALBERT, RoBERTa, and XLNet, thereby indicating the versatility and efficacy of the adversarial distribution methodology.

Discussion on Prior Work

GBDA addresses several limitations linked with existing adversarial text attacks, which typically employ heuristic methods yielding subpar efficacy, perceptibility to humans, and grammatically unnatural perturbations. Traditional approaches that rely on black-box querying often necessitate sequential token modifications and can lead to exponential growth of search space, especially for rarer words. In contrast, GBDA's framework promotes fluency and semantic retention through sophisticated integration of LLMs and similarity scores as constraints, surpassing prior text adversaries like TextFooler and BERT-Attack.

Implications and Future Projections

The introduction of GBDA marks a substantial advancement in generating adversarial examples for text models, with potential to enhance model robustness through adversarial training. Additionally, by facilitating efficient transfer attacks with minimal model access requirements and optimizing over parameter spaces, this approach can redefine adversarial methodologies across various transformer architectures.

The paper's insights suggest avenues for future research, particularly in extending adversarial distribution frameworks to encompass additional token operations beyond replacements, such as insertions and deletions, thereby increasing the naturalness and diversity of adversarial examples. Moreover, addressing the over-parameterization of the distribution matrix in longer sentences could refine the process further, optimizing computational efficiency and output quality.

In conclusion, the GBDA framework represents a significant contribution to the domain of adversarial learning, providing a foundation for advanced adversarial techniques in NLP and potentially influencing future developments in AI model security and robustness strategies.

PDF Markdown