Papers
Topics
Authors
Recent
Search
2000 character limit reached

Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification

Published 6 Feb 2024 in cs.CL and cs.AI | (2402.03686v3)

Abstract: Making inferences in text comprehension to understand the meaning is essential in language processing. This work studies the entailment verification (EV) problem of multi-sentence premises that requires a system to make multiple inferences implicitly. Studying EV for such complex premises is important because modern NLP problems, such as detecting inconsistent model-generated rationales, require complex multi-hop reasoning. However, current textual inference datasets mostly contain short premises that only partially focus on these challenges. To address this, we compile an EV benchmark that includes datasets from three NLP domains (NLI, contextual QA, and rationales) containing multi-sentence premises. On benchmarking humans and LLMs, we find that LLMs are better than humans in multi-hop reasoning across extended contexts, while humans perform better in simple deductive reasoning tasks. We also finetune a Flan-T5 model for EV using two training objectives to obtain a strong open-source model that outperforms GPT-3.5 and rivals GPT-4. Finally, we use this model to filter out inconsistent model-generated rationales in self-consistency decoding, resulting in a 6% accuracy improvement on average across three MCQ datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3050–3065, Online. Association for Computational Linguistics.
  2. A review on fact extraction and verification. ACM Comput. Surv., 55(1).
  3. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33.
  5. Neural substrates of cognitive capacity limitations. Proceedings of the National Academy of Sciences, 108(27):11252–11255.
  6. Can NLI models verify QA systems’ predictions? In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3841–3854, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  7. Scaling instruction-finetuned language models.
  8. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  10. Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 3882–3890. International Joint Conferences on Artificial Intelligence Organization. Main track.
  11. Nelson Cowan. 2001. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24(1):87–114.
  12. Kevin Crowston. 2012. Amazon mechanical turk: A research tool for organizations and information systems scholars. In Shaping the Future of ICT Research. Methods and Approaches, pages 210–221, Berlin, Heidelberg. Springer Berlin Heidelberg.
  13. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pages 177–190, Berlin, Heidelberg. Springer Berlin Heidelberg.
  14. Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  15. Transforming question answering datasets into natural language inference datasets.
  16. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
  17. Alan Garnham. 1989. Inference in language understanding: What, when, why and how. In RAINER DIETRICH and CARL F. GRAUMANN, editors, Language Processing in Social Context, volume 54 of North-Holland Linguistic Series: Linguistic Variations, pages 153–172. Elsevier.
  18. Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 163–173, Online. Association for Computational Linguistics.
  19. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL).
  20. Language models hallucinate, but may excel at fact verification.
  21. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
  22. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics.
  23. Large language models cannot self-correct reasoning yet.
  24. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401, Hong Kong, China. Association for Computational Linguistics.
  25. Maieutic prompting: Logically consistent reasoning with recursive explanations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1266–1279, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  26. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794, Copenhagen, Denmark. Association for Computational Linguistics.
  27. Deductive verification of chain-of-thought reasoning.
  28. WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  29. RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
  30. Faithful chain-of-thought reasoning.
  31. Self-refine: Iterative refinement with self-feedback.
  32. Christopher D. Manning and Bill MacCartney. 2009. Natural language inference. Stanford University.
  33. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
  34. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning.
  35. Deep learning–based text classification: A comprehensive review. ACM Comput. Surv., 54(3).
  36. Reading comprehension as natural language inference:a semantic analysis. In Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, pages 12–19, Barcelona, Spain (Online). Association for Computational Linguistics.
  37. Enhancing self-consistency and performance of pre-trained language models through natural language inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1754–1768, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  38. A negation detection assessment of gpts: analysis with the xnot360 dataset.
  39. Combining fact extraction and verification with neural semantic matching networks. In Association for the Advancement of Artificial Intelligence (AAAI).
  40. Adversarial nli: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  41. OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
  42. OpenAI. 2023. Gpt-4 technical report.
  43. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  44. RobustLR: A diagnostic benchmark for evaluating logical robustness of deductive reasoners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9614–9631, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  45. SocialIQA: Commonsense reasoning about social interactions. In EMNLP.
  46. DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics, 7:217–231.
  47. Entailer: Answering questions with faithful and truthful chains of reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2078–2093, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  48. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  49. Ul2: Unifying language learning paradigms.
  50. Lamda: Language models for dialog applications.
  51. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  52. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In the Proceedings of ICLR.
  53. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  54. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  55. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
  56. Tree of thoughts: Deliberate problem solving with large language models.
  57. The unreliability of explanations in few-shot prompting for textual reasoning. In Advances in Neural Information Processing Systems.
  58. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5823–5840, Toronto, Canada. Association for Computational Linguistics.
  59. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
Citations (7)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.