EvoGrad: A Dynamic Take on the Winograd Schema Challenge with Human Adversaries (2402.13372v2)
Abstract: While LLMs excel at the Winograd Schema Challenge (WSC), a coreference resolution task testing common-sense reasoning through pronoun disambiguation, they struggle with instances that feature minor alterations or rewording. To address this, we introduce EvoGrad, an open-source platform that harnesses a human-in-the-loop approach to create a dynamic dataset tailored to such altered WSC instances. Leveraging ChatGPT's capabilities, we expand our task instances from 182 to 3,691, setting a new benchmark for diverse common-sense reasoning datasets. Additionally, we introduce the error depth metric, assessing model stability in dynamic tasks. Our results emphasize the challenge posed by EvoGrad: Even the best performing LLM, GPT-3.5, achieves an accuracy of 65.0% with an average error depth of 7.2, a stark contrast to human performance of 92. 8% accuracy without perturbation errors. This highlights ongoing model limitations and the value of dynamic datasets in uncovering them.
- The sensitivity of language models and humans to Winograd schema perturbations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7590–7604, Online. Association for Computational Linguistics.
- Alfred V. Aho and Jeffrey D. Ullman. 1972. The Theory of Parsing, Translation and Compiling, volume 1. Prentice-Hall, Englewood Cliffs, NJ.
- American Psychological Association. 1983. Publications Manual. American Psychological Association, Washington, DC.
- Seti@ home: an experiment in public-resource computing. Communications of the ACM, 45(11):56–61.
- Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853.
- Galen Andrew and Jianfeng Gao. 2007. Scalable training of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning, pages 33–40.
- What’s in a name? are BERT named entity representations just as good for any other name? In Proceedings of the 5th Workshop on Representation Learning for NLP, pages 205–214, Online. Association for Computational Linguistics.
- Training data augmentation for low-resource morphological inflection. In Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection, pages 31–39, Vancouver. Association for Computational Linguistics.
- Edward Loper Bird, Steven and Ewan Klein. 2009. Natural Language Processing with Python. O’Reilly Media Inc.
- Daren C. Brabham. 2013. Crowdsourcing. MIT Press.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Alternation. Journal of the Association for Computing Machinery, 28(1):114–133.
- CODAH: An adversarially-authored question answering dataset for common sense. In Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP, pages 63–69, Minneapolis, USA. Association for Computational Linguistics.
- Zhiyuan Chen and Bing Liu. 2018. Lifelong machine learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–207.
- James W. Cooley and John W. Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297–301.
- Kevin Crowston. 2012. Amazon mechanical turk: A research tool for organizations and information systems scholars. In Shaping the future of ict research. methods and approaches, pages 210–221. Springer.
- Chataug: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4537–4546, Hong Kong, China. Association for Computational Linguistics.
- Back to square one: Artifact detection, training and commonsense disentanglement in the Winograd schema. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10486–10500, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Back to square one: Artifact detection, training and commonsense disentanglement in the winograd schema. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10486–10500.
- An analysis of dataset overlap on winograd-style tasks. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5855–5865.
- The KnowRef coreference corpus: Removing gender and number cues for difficult pronominal anaphora resolution. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3952–3961, Florence, Italy. Association for Computational Linguistics.
- Christiane Fellbaum. 2010. Wordnet. In Theory and applications of ontology: computer applications, pages 231–243. Springer.
- A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics.
- Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135.
- Alexander Gepperth and Barbara Hammer. 2016. Incremental learning algorithms and applications. In European symposium on artificial neural networks (ESANN).
- Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
- Dan Gusfield. 1997. Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge, UK.
- WinoLogic: A zero-shot logic-based diagnostic dataset for Winograd Schema Challenge. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3779–3789, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Mokter Hossain and Ilkka Kauranen. 2015. Crowdsourcing: a comprehensive literature review. Strategic Outsourcing: An International Journal.
- Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. In Companion Proceedings of the ACM Web Conference 2023, pages 294–297.
- Nicos Isaak and Loizos Michael. 2019. Winoflexi: A crowdsourcing platform for the development of winograd schemas. In AI 2019: Advances in Artificial Intelligence, pages 289–302, Cham. Springer International Publishing.
- Nicos Isaak and Loizos Michael. 2020. Blending nlp and machine learning for the development of winograd schemas. In Agents and Artificial Intelligence: 12th International Conference, ICAART 2020, Valletta, Malta, February 22–24, 2020, Revised Selected Papers, page 188–214, Berlin, Heidelberg. Springer-Verlag.
- Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics.
- Learning the difference that makes a difference with counterfactually augmented data. International Conference on Learning Representations (ICLR).
- Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110–4124, Online. Association for Computational Linguistics.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526.
- A surprisingly robust trick for the Winograd schema challenge. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4837–4842, Florence, Italy. Association for Computational Linguistics.
- The defeat of the winograd schema challenge. arXiv preprint arXiv:2201.02387.
- Bernhard Kratzwald and Stefan Feuerriegel. 2019. Learning from on-line user feedback in neural question answering on the web. In The World Wide Web Conference, WWW ’19, page 906–916, New York, NY, USA. Association for Computing Machinery.
- Chatgpt: Beginning of an end of manual linguistic data annotation? use case of automatic genre identification. ArXiv, abs/2303.03953.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- A continuously growing dataset of sentential paraphrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1224–1234, Copenhagen, Denmark. Association for Computational Linguistics.
- Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations.
- Adversarial filters of dataset biases. In International Conference on Machine Learning, pages 1078–1088. PMLR.
- The winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
- Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6862–6868, Online. Association for Computational Linguistics.
- CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Studying catastrophic forgetting in neural ranking models. In European Conference on Information Retrieval, pages 375–390. Springer.
- A rationale-centric framework for human-in-the-loop machine learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6986–6996, Dublin, Ireland. Association for Computational Linguistics.
- Gender bias in neural natural language processing. In Logic, Language, and Security, pages 189–202. Springer.
- Winograd schemas in portuguese. In Proceedings of 16th National Meeting on Artificial and Computational Intelligence, pages 787–798, Porto Alegre, RS, Brasil. SBC.
- Never-ending learning. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1).
- Planning, executing, and evaluating the winograd schema challenge. AI Magazine, 37(1):50–54.
- Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
- Mohammad Sadegh Rasooli and Joel R. Tetreault. 2015. Yara parser: A fast and accurate dependency parser. Computing Research Repository, arXiv:1503.06733. Version 2.
- Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14, New Orleans, Louisiana. Association for Computational Linguistics.
- Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740.
- Substructure substitution: Structured data augmentation for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3494–3508, Online. Association for Computational Linguistics.
- Learning visually-grounded semantics from contrastive adversarial samples. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3715–3727, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Lifelong machine learning systems: Beyond learning algorithms. In 2013 AAAI spring symposium series.
- On the evaluation of common-sense reasoning in natural language understanding. In Critiquing and Correcting Trends in Machine Learning NeurIPS 2018 Workshop.
- How reasonable are common-sense reasoning tasks: A case-study on the Winograd schema challenge and SWAG. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3382–3387, Hong Kong, China. Association for Computational Linguistics.
- Alan M. Turing. 1950. Computing machinery and intelligence. Mind.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering. Transactions of the Association for Computational Linguistics, 7:387–401.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2557–2563, Lisbon, Portugal. Association for Computational Linguistics.
- Ontonotes release 5.0. linguistic data consortium. Technical report, Philadelphia, Technical Report.
- Terry Winograd. 1972. Understanding natural language. Cognitive Psychology, 3(1):1–191.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950–2968, Online. Association for Computational Linguistics.
- Improved relation classification by deep recurrent neural networks with data augmentation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1461–1470, Osaka, Japan. The COLING 2016 Organizing Committee.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
- Pardis Sadat Zahraei and Ali Emami. 2024. Wsc+: Enhancing the winograd schema challenge using tree-of-experts. arXiv preprint arXiv:2401.17703.
- SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104, Brussels, Belgium. Association for Computational Linguistics.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- WinoWhy: A deep diagnosis of essential commonsense knowledge for answering Winograd schema challenge. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5736–5745, Online. Association for Computational Linguistics.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.
- Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
- Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, New Orleans, Louisiana. Association for Computational Linguistics.
- Can chatgpt reproduce human-generated labels? a study of social computing tasks. arXiv preprint arXiv:2304.10145.
- Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1651–1661, Florence, Italy. Association for Computational Linguistics.