On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models (1903.06620v2)

Published 15 Mar 2019 in cs.CL

Abstract: Adversarial examples --- perturbations to the input of a model that elicit large changes in the output --- have been shown to be an effective way of assessing the robustness of sequence-to-sequence (seq2seq) models. However, these perturbations only indicate weaknesses in the model if they do not change the input so significantly that it legitimately results in changes in the expected output. This fact has largely been ignored in the evaluations of the growing body of related literature. Using the example of untargeted attacks on machine translation (MT), we propose a new evaluation framework for adversarial attacks on seq2seq models that takes the semantic equivalence of the pre- and post-perturbation input into account. Using this framework, we demonstrate that existing methods may not preserve meaning in general, breaking the aforementioned assumption that source side perturbations should not result in changes in the expected output. We further use this framework to demonstrate that adding additional constraints on attacks allows for adversarial perturbations that are more meaning-preserving, but nonetheless largely change the output sequence. Finally, we show that performing untargeted adversarial training with meaning-preserving attacks is beneficial to the model in terms of adversarial robustness, without hurting test performance. A toolkit implementing our evaluation framework is released at https://github.com/pmichel31415/teapot-nlp.

Authors (4)

Paul Michel (27 papers)
Xian Li (116 papers)
Graham Neubig (342 papers)
Juan Miguel Pino (1 paper)

Citations (133)

View on Semantic Scholar

Summary

Evaluation of Adversarial Perturbations for Seq2Seq Models

This paper provides an in-depth analysis and novel contributions towards understanding adversarial perturbations in sequence-to-sequence (seq2seq) models, particularly those used in machine translation (MT). The primary focus is on ensuring that such adversarial perturbations are meaning-preserving while still achieving their goal of altering the outputs significantly.

Adversarial Perturbations and Model Robustness

Adversarial perturbations are subtle modifications made to a model's input to provoke incorrect outputs, often used to assess the robustness and vulnerability of the model under atypical inputs. While initially explored in continuous domains like computer vision, their application in discrete domains such as NLP introduces unique challenges. The discrete nature of text means that even minimal perturbations can be perceptible and semantically significant, complicating their evaluation. This paper highlights the inadequacies of current perturbation evaluations that overlook whether the semantic meaning of inputs is preserved.

A New Framework for Evaluation

The authors propose a framework that explicitly evaluates the semantic equivalence between original and perturbed inputs in NLP tasks. This framework introduces the concept of meaning-preserving perturbations on the source side of a sequence but meaning-altering on the target side when translated by the seq2seq model. This is essential; a robust model should produce similar outputs for semantically equivalent inputs. The paper provides methods to evaluate the effectiveness of adversarial attacks by measuring semantic similarity through both human judgment and automatic metrics.

Automatic Metrics for Semantic Evaluation

To verify the semantic integrity of perturbed sentences, the paper compared various metrics including BLEU, METEOR, and chrF. The results indicated that chrF showed a significantly stronger correlation with human judgment than the other metrics considered, suggesting chrF as a preferable choice for evaluating semantic similarity in adversarial perturbations.

Improving Meaning Preservation in Adversarial Inputs

The paper proposes several constraints to enhance meaning preservation during perturbations. Techniques such as nearest neighbors replacement (kNN) and character swaps (CharSwap) are introduced, demonstrating their efficacy in maintaining source meaning while still producing impactful adversarial attacks. Experiments show CharSwap attacks consistently lead to successful adversarial inputs, showcasing the trade-off between preserving input meaning and altering model output intentionally.

Implications and Future Directions

The findings suggest that adversarial training, when employing meaning-preserving perturbations like CharSwap, can improve model robustness without sacrificing performance on non-adversarial inputs. This presents practical implications for deploying seq2seq models in real-world applications where robustness to adversarial manipulation is vital, such as in automated translation systems.

Theoretically, this research opens avenues for developing comprehensive adversarial evaluation protocols that incorporate semantic preservation as a consistent criterion. Exploring enhanced constraints and automatic similarity measures could further refine adversarial attack strategies, making them potent tools for testing model fidelity in NLP.

In summary, this paper contributes substantially to the evaluation of adversarial perturbations in seq2seq models, offering a structured and semantically-aware framework that promises to enhance the robustness of NLP systems against adversarial threats while maintaining their translation efficacy. Future work will likely explore refining these strategies and broadening their applicability across diverse NLP tasks.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - pmichel31415/teapot-nlp: Tool for Evaluating Adversarial Perturbations on Text (61 stars)