Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks (2310.16955v2)

Published 25 Oct 2023 in cs.LG

Abstract: Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1%$\,\to\,$50.1%) and on two future unseen rounds of human generated attacks (32.5%$\,\to\,$43.4%, and 29.4%$\,\to\,$40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Better fine-tuning by reducing representational collapse. arXiv preprint arXiv:2008.03156, 2020.
  2. Beat the AI: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678, 2020. doi: 10.1162/tacl_a_00338. URL https://aclanthology.org/2020.tacl-1.43.
  3. Controlled text generation with adversarial learning. In Proceedings of the 13th International Conference on Natural Language Generation, pp.  29–34, Dublin, Ireland, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.inlg-1.5.
  4. Hate or non-hate: Translation based hate speech identification in code-mixed hinglish data set. In 2021 IEEE International Conference on Big Data (Big Data), pp.  2470–2475, 2021. doi: 10.1109/BigData52589.2021.9671526.
  5. Demographic dialectal variation in social media: A case study of African-American English. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1119–1130, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1120. URL https://aclanthology.org/D16-1120.
  6. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pp.  491–500, 2019.
  7. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075.
  8. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models, 2018.
  9. Data distributional properties drive emergent in-context learning in transformers, 2022.
  10. CODAH: An adversarially-authored question answering dataset for common sense. In Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP, pp.  63–69, Minneapolis, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-2008. URL https://aclanthology.org/W19-2008.
  11. Multi-granularity textual adversarial attack with behavior cloning. arXiv preprint arXiv:2109.04367, 2021.
  12. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  5925–5932, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1606. URL https://aclanthology.org/D19-1606.
  13. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1edEyBKDS.
  14. ValCAT: Variable-length contextualized adversarial transformations using encoder-decoder language model. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1735–1746, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.125. URL https://aclanthology.org/2022.naacl-main.125.
  15. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  16. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4537–4546, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1461. URL https://www.aclweb.org/anthology/D19-1461.
  17. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URL https://aclanthology.org/N19-1246.
  18. From hero to zéroe: A benchmark of low-level adversarial attacks. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 786–803, Suzhou, China, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.aacl-main.79.
  19. Towards linguistically generalizable NLP systems: A workshop and shared task. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pp.  1–10, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-5401. URL https://aclanthology.org/W17-5401.
  20. A survey of data augmentation approaches for nlp, 2021.
  21. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022. URL https://arxiv.org/abs/2209.07858.
  22. Black-box generation of adversarial text sequences to evade deep learning classifiers. CoRR, abs/1801.04354, 2018. URL http://arxiv.org/abs/1801.04354.
  23. Bae: Bert-based adversarial examples for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6174–6181, 2020.
  24. Generative adversarial networks, 2014. URL https://arxiv.org/abs/1406.2661.
  25. Latent code and text-based generative adversarial networks for soft-text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2248–2258, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1234. URL https://aclanthology.org/N19-1234.
  26. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  15262–15271, June 2021.
  27. Detecting word-level adversarial text attacks via SHapley additive exPlanations. In Proceedings of the 7th Workshop on Representation Learning for NLP, pp.  156–166, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.repl4nlp-1.16. URL https://aclanthology.org/2022.repl4nlp-1.16.
  28. Adversarial examples for evaluating reading comprehension systems, 2017. URL https://arxiv.org/abs/1707.07328.
  29. Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4129–4142, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1423. URL https://aclanthology.org/D19-1423.
  30. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  8018–8025, 2020.
  31. Learning the difference that makes a difference with counterfactually-augmented data, 2019. URL https://arxiv.org/abs/1909.12434.
  32. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4110–4124, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.324. URL https://aclanthology.org/2021.naacl-main.324.
  33. Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1352–1368, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.97. URL https://aclanthology.org/2022.naacl-main.97.
  34. Capturing covertly toxic speech via crowdsourcing. In Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing, pp.  14–20, 2021.
  35. Phrase-level textual adversarial attack with label preservation. In Findings of the Association for Computational Linguistics: NAACL 2022, pp.  1095–1112, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.83. URL https://aclanthology.org/2022.findings-naacl.83.
  36. Contextualized perturbation for textual adversarial attack. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5053–5069, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.400. URL https://aclanthology.org/2021.naacl-main.400.
  37. Bert-attack: Adversarial attack against bert using bert. ArXiv, abs/2004.09984, 2020a.
  38. BERT-ATTACK: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6193–6202, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.500. URL https://aclanthology.org/2020.emnlp-main.500.
  39. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  40. Generate your counterfactuals: Towards controlled counterfactual generation for text. CoRR, abs/2012.04698, 2020. URL https://arxiv.org/abs/2012.04698.
  41. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2017.
  42. The king is naked: on the notion of robustness for natural language processing. In AAAI, 2022.
  43. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3428–3448, Florence, Italy, July 2019a. Association for Computational Linguistics. doi: 10.18653/v1/P19-1334. URL https://aclanthology.org/P19-1334.
  44. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3428–3448, 2019b.
  45. Robust conversational agents against imperceptible toxicity triggers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2831–2847, 2022. URL https://aclanthology.org/2022.naacl-main.204/.
  46. MTNT: A testbed for machine translation of noisy text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  543–553, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1050. URL https://aclanthology.org/D18-1050.
  47. On evaluation of adversarial perturbations for sequence-to-sequence models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  3103–3114, Minneapolis, Minnesota, June 2019a. Association for Computational Linguistics. doi: 10.18653/v1/N19-1314. URL https://aclanthology.org/N19-1314.
  48. On evaluation of adversarial perturbations for sequence-to-sequence models, 2019b. URL https://arxiv.org/abs/1903.06620.
  49. Syntactic data augmentation increases robustness to inference heuristics, 2020.
  50. Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1765–1773, 2017.
  51. Reevaluating adversarial examples in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3829–3839, Online, November 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.341. URL https://aclanthology.org/2020.findings-emnlp.341.
  52. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  119–126, 2020b.
  53. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4885–4901, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.441. URL https://aclanthology.org/2020.acl-main.441.
  54. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3419–3448, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.225.
  55. Mauve: Measuring the gap between neural text and human text using divergence frontiers. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  4816–4828. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/260c2432a0eecc28ce03c10dadc078a4-Paper.pdf.
  56. Deepfake text detection: Limitations and opportunities. In 2023 2023 IEEE Symposium on Security and Privacy (SP) (SP), pp.  19–36, Los Alamitos, CA, USA, may 2023. IEEE Computer Society. doi: 10.1109/SP46215.2023.00002. URL https://doi.ieeecomputersociety.org/10.1109/SP46215.2023.00002.
  57. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  58. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  1085–1097, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1103. URL https://aclanthology.org/P19-1103.
  59. Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  856–865, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1079. URL https://aclanthology.org/P18-1079.
  60. Explaining NLP models via minimal contrastive editing (MiCE). In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  3840–3852, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.336. URL https://aclanthology.org/2021.findings-acl.336.
  61. A framework of severity for harmful content online. Proc. ACM Hum.-Comput. Interact., 5(CSCW2), oct 2021. doi: 10.1145/3479512. URL https://doi.org/10.1145/3479512.
  62. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6199.
  63. It’s morphin’ time! Combating linguistic discrimination with inflectional perturbations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2920–2935, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.263. URL https://aclanthology.org/2020.acl-main.263.
  64. The FEVER2.0 shared task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER), pp.  1–6, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-6601. URL https://aclanthology.org/D19-6601.
  65. Adversarial risk and the dangers of evaluating against weak attacks, 2018.
  66. Learning from the worst: Dynamically generated datasets to improve online hate detection, 2021.
  67. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2153–2162, 2019.
  68. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461, 2018. URL http://arxiv.org/abs/1804.07461.
  69. Infobert: Improving robustness of language models from an information theoretic perspective. arXiv preprint arXiv:2010.02329, 2020a.
  70. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840, 2021.
  71. Cat-gen: Improving robustness in nlp models via controlled adversarial text generation. In EMNLP, 2020b.
  72. Sarah West. Raging against the machine: Network gatekeeping and collective action on social media platforms. Media and Communication, 5(3):28–36, 2017. ISSN 2183-2439. doi: 10.17645/mac.v5i3.989. URL https://www.cogitatiopress.com/mediaandcommunication/article/view/989.
  73. Toward adversarial training on contextualized language representation. arXiv preprint arXiv:2305.04557, 2023.
  74. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  6707–6723, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.523. URL https://aclanthology.org/2021.acl-long.523.
  75. Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2950–2968, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.235. URL https://aclanthology.org/2021.naacl-main.235.
  76. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. CoRR, abs/1809.09600, 2018. URL http://arxiv.org/abs/1809.09600.
  77. Automated crowdturfing attacks and defenses in online review systems. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp.  1143–1158, 2017.
  78. Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6066–6080, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.540. URL https://aclanthology.org/2020.acl-main.540.
  79. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  93–104, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1009. URL https://aclanthology.org/D18-1009.
  80. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
  81. Record: Bridging the gap between human and machine commonsense reading comprehension. CoRR, abs/1810.12885, 2018. URL http://arxiv.org/abs/1810.12885.
  82. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41, 2020.
  83. Paws: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  1298–1308, 2019.
  84. Certified robustness against natural language attacks by causal intervention. In International Conference on Machine Learning, pp. 26958–26970. PMLR, 2022.
  85. Adversarially regularized autoencoders. In International conference on machine learning, pp. 5902–5911. PMLR, 2018.
  86. Generating natural adversarial examples. arXiv preprint arXiv:1710.11342, 2017.
  87. Neural deepfake detection with factual structure of text. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  2461–2470, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.193. URL https://aclanthology.org/2020.emnlp-main.193.
  88. Making parameter-efficient tuning more efficient: A unified framework for classification tasks. In Proceedings of the 29th International Conference on Computational Linguistics, pp.  7053–7064, 2022.
  89. Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Aradhana Sinha (6 papers)
  2. Ananth Balashankar (13 papers)
  3. Ahmad Beirami (86 papers)
  4. Thi Avrahami (3 papers)
  5. Jilin Chen (32 papers)
  6. Alex Beutel (52 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.