A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers (2405.11904v1)
Abstract: Text classifiers are vulnerable to adversarial examples -- correctly-classified examples that are deliberately transformed to be misclassified while satisfying acceptability constraints. The conventional approach to finding adversarial examples is to define and solve a combinatorial optimisation problem over a space of allowable transformations. While effective, this approach is slow and limited by the choice of transformations. An alternate approach is to directly generate adversarial examples by fine-tuning a pre-trained LLM, as is commonly done for other text-to-text tasks. This approach promises to be much quicker and more expressive, but is relatively unexplored. For this reason, in this work we train an encoder-decoder paraphrase model to generate a diverse range of adversarial examples. For training, we adopt a reinforcement learning algorithm and propose a constraint-enforcing reward that promotes the generation of valid adversarial examples. Experimental results over two text classification datasets show that our model has achieved a higher success rate than the original paraphrase model, and overall has proved more effective than other competitive attacks. Finally, we show how key design choices impact the generated examples and discuss the strengths and weaknesses of the proposed approach.
- Battista Biggio and Fabio Roli. 2018. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition 84 (2018), 317–331. https://doi.org/10.1016/j.patcog.2018.07.023
- Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples. Proceedings of the AAAI Conference on Artificial Intelligence 34, 04 (Apr. 2020), 3601–3608. https://doi.org/10.1609/aaai.v34i04.5767
- Prithiviraj Damodaran. 2021. Parrot: Paraphrase generation for NLU. https://github.com/PrithivirajDamodaran/Parrot_Paraphraser
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Melbourne, Australia). Association for Computational Linguistics, 1383–1392. http://aclweb.org/anthology/P18-1128
- HotFlip: White-Box Adversarial Examples for Text Classification. (2014).
- Wee Chung Gan and Hwee Tou Ng. 2019. Improving the Robustness of Question Answering Systems to Question Paraphrasing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6065–6075. https://doi.org/10.18653/v1/P19-1610
- Siddhant Garg and Goutham Ramakrishnan. 2020. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 6174–6181. https://doi.org/10.18653/v1/2020.emnlp-main.498
- Backpropagation through the Void: Optimizing control variates for black-box gradient estimation. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018.
- Deep Reinforcement Learning That Matters. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (New Orleans, Louisiana, USA) (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 392, 8 pages.
- The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations. https://openreview.net/forum?id=rygGQyrFvH
- Toward Controlled Generation of Text. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1587–1596. https://proceedings.mlr.press/v70/hu17e.html
- Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1875–1885. https://doi.org/10.18653/v1/N18-1170
- Sequence Tutor: Conservative fine-tuning of sequence generation models with KL-control. In Proceedings of the 34th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 70). PMLR, 1645–1654. http://proceedings.mlr.press/v70/jaques17a.html
- Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020 (2019), 8018–8025. https://doi.org/10.1609/aaai.v34i05.6311
- Solomon Kullback and R. A. Leibler. 1951. On Information and Sufficiency. Annals of Mathematical Statistics 22 (1951), 79–86.
- Xin Li and Dan Roth. 2002. Learning Question Classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics. https://www.aclweb.org/anthology/C02-1150
- Good, Better, Best: Textual Distractors Generation for Multi-Choice VQA via Policy Gradient. In CVPR 2022 Workshop on Open-Domain Retrieval Under a Multi-Modal Setting (O-DRUM @ CVPR 2022).
- Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology 65 (2014).
- hdbscan: Hierarchical density based clustering. The Journal of Open Source Software 2, 11 (mar 2017). https://doi.org/10.21105/joss.00205
- UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software 3, 29 (2018), 861. https://doi.org/10.21105/joss.00861
- On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 3103–3114.
- Reevaluating Adversarial Examples in Natural Language. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3829–3839. https://doi.org/10.18653/v1/2020.findings-emnlp.341
- TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 119–126.
- Counter-fitting Word Vectors to Linguistic Constraints. In Proceedings of HLT-NAACL.
- Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
- Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 4569–4580. https://doi.org/10.18653/v1/2021.emnlp-main.374
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/1908.10084
- Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (7 2019), 1085–1097. https://doi.org/10.18653/v1/p19-1103
- Generating Natural Language Adversarial Examples on a Large Scale with Generative Models. CoRR abs/2003.10388 (2020). arXiv:2003.10388 https://arxiv.org/abs/2003.10388
- Token-Modification Adversarial Attacks for Natural Language Processing: A Survey. https://doi.org/10.48550/ARXIV.2103.00676
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019).
- Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems (Denver, CO) (NIPS’99). MIT Press, Cambridge, MA, USA, 1057–1063.
- Diverse Beam Search for Improved Description of Complex Scenes. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (Apr. 2018). https://doi.org/10.1609/aaai.v32i1.12340
- Prashanth Vijayaraghavan and Deb Roy. 2019. Generating black-box adversarial examples for text classifiers using a deep reinforced model. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 711–726.
- Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2153–2162. https://doi.org/10.18653/v1/D19-1221
- Imitation Attacks and Defenses for Black-box Machine Translation Systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5531–5546. https://doi.org/10.18653/v1/2020.emnlp-main.446
- Ke Wang and Xiaojun Wan. 2018. SentiGAN: Generating Sentimental Texts via Mixture Adversarial Networks. 4446–4452. https://doi.org/10.24963/ijcai.2018/618
- Natural Language Adversarial Attack and Defense in Word Level. arXiv:1909.06723v3 [cs.CL]
- Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 8, 3–4 (May 1992), 229–256. https://doi.org/10.1007/BF00992696
- Catherine Wong. 2017. DANCin SEQ2SEQ: Fooling Text Classifiers with Adversarial Text Example Generation. https://doi.org/10.48550/ARXIV.1712.05419
- Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Online, 323–332. https://doi.org/10.18653/v1/2020.blackboxnlp-1.30
- Adversarial Attacks on Deep-learning Models in Natural Language Processing. ACM Transactions on Intelligent Systems and Technology 11, 3 (2020), 1–41. https://doi.org/10.1145/3374217
- Generating Natural Adversarial Examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=H1BLjgZCb
- Fine-Tuning Language Models from Human Preferences. arXiv preprint arXiv:1909.08593 (2019). https://arxiv.org/abs/1909.08593