On Adversarial Examples for Character-Level Neural Machine Translation (1806.09030v1)

Published 23 Jun 2018 in cs.CL and cs.AI

Abstract: Evaluating on adversarial examples has become a standard procedure to measure robustness of deep learning models. Due to the difficulty of creating white-box adversarial examples for discrete text input, most analyses of the robustness of NLP models have been done through black-box adversarial examples. We investigate adversarial examples for character-level neural machine translation (NMT), and contrast black-box adversaries with a novel white-box adversary, which employs differentiable string-edit operations to rank adversarial changes. We propose two novel types of attacks which aim to remove or change a word in a translation, rather than simply break the NMT. We demonstrate that white-box adversarial examples are significantly stronger than their black-box counterparts in different attack scenarios, which show more serious vulnerabilities than previously known. In addition, after performing adversarial training, which takes only 3 times longer than regular training, we can improve the model's robustness significantly.

Authors (3)

Javid Ebrahimi (7 papers)
Daniel Lowd (23 papers)
Dejing Dou (112 papers)

Citations (209)

View on Semantic Scholar

Summary

Adversarial Examples in Character-Level Neural Machine Translation

The paper "On Adversarial Examples for Character-Level Neural Machine Translation" by Ebrahimi et al. presents an intriguing exploration into the vulnerabilities of neural machine translation (NMT) systems. Specifically, it investigates adversarial examples generated for character-level NMT and contrasts black-box adversaries with a novel white-box adversary that uses differentiable string-edit operations. The paper proposes two innovative attack types aimed at altering specific words in translations, indicating potential vulnerabilities of NMT systems that are more severe than previously recognized. Importantly, adversarial training is shown to significantly enhance model robustness, with training taking only three times longer than non-adversarial approaches.

Key Contributions and Observations

White-Box vs. Black-Box Adversaries:
- White-box adversaries, having access to model parameters and using gradients to rank adversarial manipulations, demonstrate a significantly stronger impact on NMT systems compared to black-box methods, which rely on heuristic manipulations.
- The paper exemplifies how white-box attacks can expose serious system vulnerabilities that are otherwise undetectable by black-box techniques.
Novel Character-Level Attacks:
- Two attack types are proposed: controlled attacks, which aim to suppress specific words in translations, and targeted attacks, which strive to push specific words into translations while maintaining fluency in the rest of the text. These methods are evaluated using metrics that focus on specific adversarial goals beyond merely decreasing BLEU scores.
Efficiency and Impact:
- The white-box attacks utilize a gradient-based approach to efficiently estimate and apply text manipulations. The experiments demonstrate the efficacy of white-box attacks over black-box attacks especially in targeted and controlled scenarios, revealing intricate model vulnerabilities.
- Investigating the correlation between estimated adversarial impact and the actual impact underscores the utility of derivative-based methods for model evaluation and defense formulation.
Adversarial Training Benefits:
- By incorporating adversarial training using the generated white-box examples, the paper shows significant improvements in model resilience, even when the adversary applies complex perturbations to inputs.
- This work extends methods like HotFlip by including a broader array of character-level manipulations and demonstrates practical improvements in model robustness with a measurable computational cost.

Implications and Future Directions

This paper highlights the critical need for understanding and defending against adversaries targeting NMT systems, especially as these models are increasingly deployed in real-world applications with potentially substantial societal impacts. The exploration of adversarial training provides meaningful insights into enhancing model robustness, contributing to more reliable NMT systems.

Future directions may include extending the scope of white-box adversarial techniques to multi-word and context-sensitive attacks and refining the evaluation metrics to encompass broader translation contexts and languages. The integration of adversarial training frameworks into larger translation pipelines and the exploration of even more complex perturbations in text processing systems could be promising areas for further inquiry.

In conclusion, Ebrahimi et al. provide valuable methodologies and insights into adversarial testing and training strategies for NMT, offering a pathway to more secure and reliable translation models. This work stands as a noteworthy contribution to the ongoing efforts to fortify natural language processing systems against adversarial vulnerabilities.

PDF Markdown

Related Papers

Find Related Papers