Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers (2405.11904v1)

Published 20 May 2024 in cs.CL

Abstract: Text classifiers are vulnerable to adversarial examples -- correctly-classified examples that are deliberately transformed to be misclassified while satisfying acceptability constraints. The conventional approach to finding adversarial examples is to define and solve a combinatorial optimisation problem over a space of allowable transformations. While effective, this approach is slow and limited by the choice of transformations. An alternate approach is to directly generate adversarial examples by fine-tuning a pre-trained LLM, as is commonly done for other text-to-text tasks. This approach promises to be much quicker and more expressive, but is relatively unexplored. For this reason, in this work we train an encoder-decoder paraphrase model to generate a diverse range of adversarial examples. For training, we adopt a reinforcement learning algorithm and propose a constraint-enforcing reward that promotes the generation of valid adversarial examples. Experimental results over two text classification datasets show that our model has achieved a higher success rate than the original paraphrase model, and overall has proved more effective than other competitive attacks. Finally, we show how key design choices impact the generated examples and discuss the strengths and weaknesses of the proposed approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Battista Biggio and Fabio Roli. 2018. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition 84 (2018), 317–331. https://doi.org/10.1016/j.patcog.2018.07.023
  2. Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples. Proceedings of the AAAI Conference on Artificial Intelligence 34, 04 (Apr. 2020), 3601–3608. https://doi.org/10.1609/aaai.v34i04.5767
  3. Prithiviraj Damodaran. 2021. Parrot: Paraphrase generation for NLU. https://github.com/PrithivirajDamodaran/Parrot_Paraphraser
  4. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  5. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Melbourne, Australia). Association for Computational Linguistics, 1383–1392. http://aclweb.org/anthology/P18-1128
  6. HotFlip: White-Box Adversarial Examples for Text Classification. (2014).
  7. Wee Chung Gan and Hwee Tou Ng. 2019. Improving the Robustness of Question Answering Systems to Question Paraphrasing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 6065–6075. https://doi.org/10.18653/v1/P19-1610
  8. Siddhant Garg and Goutham Ramakrishnan. 2020. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 6174–6181. https://doi.org/10.18653/v1/2020.emnlp-main.498
  9. Backpropagation through the Void: Optimizing control variates for black-box gradient estimation. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018.
  10. Deep Reinforcement Learning That Matters. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (New Orleans, Louisiana, USA) (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 392, 8 pages.
  11. The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations. https://openreview.net/forum?id=rygGQyrFvH
  12. Toward Controlled Generation of Text. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1587–1596. https://proceedings.mlr.press/v70/hu17e.html
  13. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1875–1885. https://doi.org/10.18653/v1/N18-1170
  14. Sequence Tutor: Conservative fine-tuning of sequence generation models with KL-control. In Proceedings of the 34th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 70). PMLR, 1645–1654. http://proceedings.mlr.press/v70/jaques17a.html
  15. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020 (2019), 8018–8025. https://doi.org/10.1609/aaai.v34i05.6311
  16. Solomon Kullback and R. A. Leibler. 1951. On Information and Sufficiency. Annals of Mathematical Statistics 22 (1951), 79–86.
  17. Xin Li and Dan Roth. 2002. Learning Question Classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics. https://www.aclweb.org/anthology/C02-1150
  18. Good, Better, Best: Textual Distractors Generation for Multi-Choice VQA via Policy Gradient. In CVPR 2022 Workshop on Open-Domain Retrieval Under a Multi-Modal Setting (O-DRUM @ CVPR 2022).
  19. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology 65 (2014).
  20. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software 2, 11 (mar 2017). https://doi.org/10.21105/joss.00205
  21. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software 3, 29 (2018), 861. https://doi.org/10.21105/joss.00861
  22. On Evaluation of Adversarial Perturbations for Sequence-to-Sequence Models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 3103–3114.
  23. Reevaluating Adversarial Examples in Natural Language. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3829–3839. https://doi.org/10.18653/v1/2020.findings-emnlp.341
  24. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 119–126.
  25. Counter-fitting Word Vectors to Linguistic Constraints. In Proceedings of HLT-NAACL.
  26. Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
  27. Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 4569–4580. https://doi.org/10.18653/v1/2021.emnlp-main.374
  28. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
  29. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://arxiv.org/abs/1908.10084
  30. Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (7 2019), 1085–1097. https://doi.org/10.18653/v1/p19-1103
  31. Generating Natural Language Adversarial Examples on a Large Scale with Generative Models. CoRR abs/2003.10388 (2020). arXiv:2003.10388 https://arxiv.org/abs/2003.10388
  32. Token-Modification Adversarial Attacks for Natural Language Processing: A Survey. https://doi.org/10.48550/ARXIV.2103.00676
  33. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv abs/1910.01108 (2019).
  34. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems (Denver, CO) (NIPS’99). MIT Press, Cambridge, MA, USA, 1057–1063.
  35. Diverse Beam Search for Improved Description of Complex Scenes. Proceedings of the AAAI Conference on Artificial Intelligence 32, 1 (Apr. 2018). https://doi.org/10.1609/aaai.v32i1.12340
  36. Prashanth Vijayaraghavan and Deb Roy. 2019. Generating black-box adversarial examples for text classifiers using a deep reinforced model. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 711–726.
  37. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 2153–2162. https://doi.org/10.18653/v1/D19-1221
  38. Imitation Attacks and Defenses for Black-box Machine Translation Systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5531–5546. https://doi.org/10.18653/v1/2020.emnlp-main.446
  39. Ke Wang and Xiaojun Wan. 2018. SentiGAN: Generating Sentimental Texts via Mixture Adversarial Networks. 4446–4452. https://doi.org/10.24963/ijcai.2018/618
  40. Natural Language Adversarial Attack and Defense in Word Level. arXiv:1909.06723v3 [cs.CL]
  41. Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 8, 3–4 (May 1992), 229–256. https://doi.org/10.1007/BF00992696
  42. Catherine Wong. 2017. DANCin SEQ2SEQ: Fooling Text Classifiers with Adversarial Text Example Generation. https://doi.org/10.48550/ARXIV.1712.05419
  43. Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Online, 323–332. https://doi.org/10.18653/v1/2020.blackboxnlp-1.30
  44. Adversarial Attacks on Deep-learning Models in Natural Language Processing. ACM Transactions on Intelligent Systems and Technology 11, 3 (2020), 1–41. https://doi.org/10.1145/3374217
  45. Generating Natural Adversarial Examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=H1BLjgZCb
  46. Fine-Tuning Language Models from Human Preferences. arXiv preprint arXiv:1909.08593 (2019). https://arxiv.org/abs/1909.08593

Summary

We haven't generated a summary for this paper yet.