How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation (2401.11185v1)
Abstract: Dynamic adversarial question generation, where humans write examples to stump a model, aims to create examples that are realistic and informative. However, the advent of LLMs has been a double-edged sword for human authors: more people are interested in seeing and pushing the limits of these models, but because the models are so much stronger an opponent, they are harder to defeat. To understand how these models impact adversarial question writing process, we enrich the writing guidance with LLMs and retrieval models for the authors to reason why their questions are not adversarial. While authors could create interesting, challenging adversarial questions, they sometimes resort to tricks that result in poor questions that are ambiguous, subjective, or confusing not just to a computer but also to humans. To address these issues, we propose new metrics and incentives for eliciting good, challenging questions and present a new dataset of adversarially authored questions.
- Beat the AI: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678.
- Improving question answering model robustness with synthetic adversarial data generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8830–8848, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Samuel R Bowman. 2023. Eight things to know about large language models. arXiv preprint arXiv:2304.00612.
- Samuel R. Bowman and George Dahl. 2021. What will it take to fix benchmarking in natural language understanding? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4843–4855, Online. Association for Computational Linguistics.
- Jordan Boyd-Graber and Benjamin Börschinger. 2020. What question answering can learn from trivia nerds. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7422–7435, Online. Association for Computational Linguistics.
- Fool me twice: Entailment from Wikipedia gamification. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 352–365, Online. Association for Computational Linguistics.
- Towards linguistically generalizable NLP systems: A workshop and shared task. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 1–10, Copenhagen, Denmark. Association for Computational Linguistics.
- Pathologies of neural models make interpretations difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3719–3728, Brussels, Belgium. Association for Computational Linguistics.
- How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv e-prints, pages arXiv–2301.
- Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
- Mark Hopkins and Jonathan May. 2013. Models of translation competitions. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1416–1424, Sofia, Bulgaria. Association for Computational Linguistics.
- Tommi S. Jaakkola and Michael I. Jordan. 1997. A variational approach to Bayesian logistic regression models and their extensions. In Proceedings of the Sixth International Workshop on Artificial Intelligence and Statistics, volume R1 of Proceedings of Machine Learning Research, pages 283–294. PMLR. Reissued by PMLR on 30 March 2021.
- Ken Jennings. 2007. Brainiac: adventures in the curious, competitive, compulsive world of trivia buffs. Villard.
- Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110–4124, Online. Association for Computational Linguistics.
- QA2𝑄superscript𝐴2{QA}^{2}italic_Q italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Question answering with questionable assumptions. arXiv preprint arXiv:2212.10003.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- Building an evaluation scale using item response theory. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 648–657, Austin, Texas. Association for Computational Linguistics.
- Learning latent parameters without human response patterns: Item response theory with artificial crowds. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4249–4259.
- Quiz-style question generation for news stories. In Proceedings of the Web Conference 2021, pages 2501–2511.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Reta-llm: A retrieval-augmented large language model toolkit. arXiv preprint arXiv:2306.05212.
- Statistical theories of mental test scores. 1968. Reading: Addison-Wesley.
- Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283.
- Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. In Neural Information Processing Systems.
- Item response theory in ai: Analysing machine learning classifiers at the instance level. Artificial intelligence, 271:18–42.
- AmbigQA: Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797, Online. Association for Computational Linguistics.
- Crepe: Open-domain question answering with false presuppositions. arXiv e-prints, pages arXiv–2211.
- Interactiveie: Towards assessing the strength of human-ai collaboration in improving the performance of information extraction.
- Bayesian prior choice in irt estimation using mcmc and variational bayes. Frontiers in psychology, 7:1422.
- Analyzing compositionality-sensitivity of nli models. ArXiv, abs/1811.07033.
- Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
- John K Pollard. 2006. Student reflection using a web-based quiz. In 2006 7th International Conference on Information Technology Based Higher Education and Training, pages 871–874. IEEE.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
- Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
- Mark Reckase. 1998. Item response theory: Parameter estimation techniques. Applied Psychological Measurement, 22:89–91.
- Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers), pages 856–865.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
- Evaluation examples are not equally informative: How should that change NLP leaderboards? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4486–4503, Online. Association for Computational Linguistics.
- Pedro Rodriguez and Jordan Boyd-Graber. 2021. Evaluation paradigms in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9630–9642, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Quizbowl: The case for incremental question answering. CoRR, abs/1904.04792.
- Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Comput. Surv., 55(10).
- Joo Sedoc and Lyle Ungar. 2020. Item response theory for efficient human evaluation of chatbots. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 21–33, Online. Association for Computational Linguistics.
- Human-adversarial visual question answering. CoRR, abs/2106.02280.
- Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11:1–17.
- What’s the meaning of superhuman performance in today’s NLU? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12471–12491, Toronto, Canada. Association for Computational Linguistics.
- Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering. Transactions of the Association for Computational Linguistics, 7:387–401.
- Analyzing dynamic adversarial training data in the limit. In Findings of the Association for Computational Linguistics: ACL 2022, pages 202–217.
- Diversify question generation with continuous content selectors and question type modeling. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2134–2143, Online. Association for Computational Linguistics.
- Making neural qa as simple as possible but not simpler. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 271–280.
- CREPE: Open-domain question answering with false presuppositions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10457–10480, Toronto, Canada. Association for Computational Linguistics.
- Tasklama: Probing the complex task understanding of language models. arXiv preprint arXiv:2308.15299.
- Yoo Yeon Sung (7 papers)
- Ishani Mondal (23 papers)
- Jordan Boyd-Graber (68 papers)