Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

No Strong Feelings One Way or Another: Re-operationalizing Neutrality in Natural Language Inference (2306.09918v1)

Published 16 Jun 2023 in cs.CL and cs.AI

Abstract: Natural Language Inference (NLI) has been a cornerstone task in evaluating LLMs' inferential reasoning capabilities. However, the standard three-way classification scheme used in NLI has well-known shortcomings in evaluating models' ability to capture the nuances of natural human reasoning. In this paper, we argue that the operationalization of the neutral label in current NLI datasets has low validity, is interpreted inconsistently, and that at least one important sense of neutrality is often ignored. We uncover the detrimental impact of these shortcomings, which in some cases leads to annotation datasets that actually decrease performance on downstream tasks. We compare approaches of handling annotator disagreement and identify flaws in a recent NLI dataset that designs an annotator study based on a problematic operationalization. Our findings highlight the need for a more refined evaluation framework for NLI, and we hope to spark further discussion and action in the NLP community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Evaluation benchmarks for Spanish sentence representations. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6024–6034, Marseille, France. European Language Resources Association.
  2. Deborah L. Bandalos. 2018. Measurement theory and applications for the social sciences. Methodology in the Social Sciences. The Guilford Press, New York, New York ;.
  3. We need to consider disagreement in evaluation. In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, pages 15–21, Online. Association for Computational Linguistics.
  4. Wiebke Bleidorn and Christopher James Hopwood. 2019. Using machine learning to advance personality assessment and theory. Personality and Social Psychology Review, 23(2):190–203.
  5. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  6. Samuel R. Bowman and George Dahl. 2021. What will it take to fix benchmarking in natural language understanding? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4843–4855, Online. Association for Computational Linguistics.
  7. Selmer Bringsjord. 2008. The logicist manifesto: At long last let logic-based artificial intelligence become a field unto itself. Journal of Applied Logic, 6(4):502–525. The Philosophy of Computer Science.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  9. Donald T Campbell and Donald W Fiske. 1959. Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological bulletin, 56(2):81.
  10. Uncertain natural language inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8772–8779, Online. Association for Computational Linguistics.
  11. Palm: Scaling language modeling with pathways. arxiv:2204.02311.
  12. Electra: Pre-training text encoders as discriminators rather than generators. In ICLR.
  13. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
  14. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment: First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, pages 177–190. Springer.
  15. Transforming question answering datasets into natural language inference datasets. arXiv preprint arXiv:1809.02922.
  16. Systematic reviews in health care: meta-analysis in context. John Wiley & Sons.
  17. Rudolph Flesch. 1948. A new readability yardstick. Journal of Applied Psychology, 32(3):221–233.
  18. Natural language inference with mixed effects. In Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, pages 81–87, Barcelona, Spain (Online). Association for Computational Linguistics.
  19. A probabilistic classification approach for lexical textual entailment. In AAAI, pages 1050–1055. Pittsburgh, PA.
  20. Natural language inference over interaction space. arXiv preprint arXiv:1709.04348v2.
  21. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
  22. A dataset for statutory reasoning in tax law entailment and question answering.
  23. Are natural language inference models IMPPRESsive? Learning IMPlicature and PRESupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8690–8705, Online. Association for Computational Linguistics.
  24. Nan-Jiang Jiang and Marie-Catherine de Marneffe. 2022. Investigating Reasons for Disagreement in Natural Language Inference. Transactions of the Association for Computational Linguistics, 10:1357–1374.
  25. Nanjiang Jiang and Marie-Catherine de Marneffe. 2019. Evaluating BERT for natural language inference: A case study on the CommitmentBank. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6086–6091, Hong Kong, China. Association for Computational Linguistics.
  26. Maieutic prompting: Logically consistent reasoning with recursive explanations. arXiv preprint arXiv:2205.11822.
  27. Curing the sick and other nli maladies. Computational Linguistics, page 1–45.
  28. WatClaimCheck: A new dataset for claim entailment and inference. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1293–1304, Dublin, Ireland. Association for Computational Linguistics.
  29. Yuta Koreeda and Christopher Manning. 2021. ContractNLI: A dataset for document-level natural language inference for contracts. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1907–1919, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  30. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
  31. C. I. Lewis. 1912. Implication and the algebra of logic. Mind, 21(84):522–531.
  32. WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  33. Roberta: A robustly optimized bert pretraining approach.
  34. Bill MacCartney and Christopher D Manning. 2009. An extended model of natural logic. In Proceedings of the eight international conference on computational semantics, pages 140–156.
  35. Bill MacCartney and Christopher D. Manning. 2014. Natural Logic and Natural Language Inference, pages 129–147. Springer Netherlands, Dordrecht.
  36. Embracing ambiguity: Shifting the training target of NLI models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 862–869, Online. Association for Computational Linguistics.
  37. Cognitive interviewing methodology. Wiley Series in Survey Methodology. Hoboken, New Jersey.
  38. Enhancing self-consistency and performance of pre-trained language models through natural language inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1754–1768, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  39. Combining fact extraction and verification with neural semantic matching networks. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press.
  40. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
  41. What can we learn from collective human opinions on natural language inference data? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9131–9143, Online. Association for Computational Linguistics.
  42. Animesh Nighojkar and John Licato. 2021a. Improving paraphrase detection with the adversarial paraphrasing task. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7106–7116, Online. Association for Computational Linguistics.
  43. Animesh Nighojkar and John Licato. 2021b. Mutual implication as a measure of textual equivalence. The International FLAIRS Conference Proceedings, 34.
  44. A case for a range of acceptable annotations. In Workshop on Subjectivity, Ambiguity and Disagreement in Crowdsourcing.
  45. Does putting a linguist in the loop improve nlu data collection? arXiv preprint arXiv:2104.07179.
  46. Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
  47. Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  48. Adam Poliak. 2020. A survey on recognizing textual entailment as an NLP evaluation. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 92–109, Online. Association for Computational Linguistics.
  49. On the evaluation of semantic phenomena in neural machine translation using natural language inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 513–523, New Orleans, Louisiana. Association for Computational Linguistics.
  50. Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 175–190, Seattle, United States. Association for Computational Linguistics.
  51. John Rust and Susan Golombok. 2014. Modern psychometrics: The science of psychological assessment. Routledge.
  52. Keisuke Sakaguchi and Benjamin Van Durme. 2018. Efficient online scalar annotation with bounded support. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 208–218, Melbourne, Australia. Association for Computational Linguistics.
  53. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  54. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  55. Scaling and disagreements: Bias, noise, and ambiguity. Frontiers in Artificial Intelligence, 5.
  56. Learning from disagreement: A survey. J. Artif. Int. Res., 72:1385–1470.
  57. Capture human disagreement distributions by calibrated networks for natural language inference. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1524–1535, Dublin, Ireland. Association for Computational Linguistics.
  58. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  59. Fake news detection as natural language inference. arXiv preprint arXiv:1907.07347.
  60. Zhanye Yang. 2022. Legalnli: natural language inference for legal compliance inspection. In International Conference on Advanced Algorithms and Neural Networks (AANN 2022), volume 12285, pages 144–150. SPIE.
  61. DocNLI: A large-scale dataset for document-level natural language inference. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4913–4922, Online. Association for Computational Linguistics.
  62. On the validity of machine learning-based next generation science assessments: A validity inferential network. Journal of Science Education and Technology, 30:298–312.
  63. Ordinal common-sense inference. Transactions of the Association for Computational Linguistics, 5:379–395.
  64. Capturing label distribution: A case study in nli. arXiv preprint arXiv:2102.06859.
  65. Xinliang Frederick Zhang and Marie-Catherine de Marneffe. 2021. Identifying inherent disagreement in natural language inference. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4908–4915, Online. Association for Computational Linguistics.
  66. Distributed NLI: Learning to predict human opinion distributions for language reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 972–987, Dublin, Ireland. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Animesh Nighojkar (7 papers)
  2. Antonio Laverghetta Jr. (8 papers)
  3. John Licato (13 papers)
Citations (3)