Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information (2306.07934v1)

Published 13 Jun 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Automated reasoning with unstructured natural text is a key requirement for many potential applications of NLP and for developing robust AI systems. Recently, LLMs (LMs) have demonstrated complex reasoning capacities even without any finetuning. However, existing evaluation for automated reasoning assumes access to a consistent and coherent set of information over which models reason. When reasoning in the real-world, the available information is frequently inconsistent or contradictory, and therefore models need to be equipped with a strategy to resolve such conflicts when they arise. One widely-applicable way of resolving conflicts is to impose preferences over information sources (e.g., based on source credibility or information recency) and adopt the source with higher preference. In this paper, we formulate the problem of reasoning with contradictory information guided by preferences over sources as the classical problem of defeasible reasoning, and develop a dataset called BoardgameQA for measuring the reasoning capacity of LMs in this setting. BoardgameQA also incorporates reasoning with implicit background knowledge, to better reflect reasoning problems in downstream applications. We benchmark various LMs on BoardgameQA and the results reveal a significant gap in the reasoning capacity of state-of-the-art LMs on this problem, showing that reasoning with conflicting information does not surface out-of-the-box in LMs. While performance can be improved with finetuning, it nevertheless remains poor.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Penguins don’t fly: Reasoning about generics through instantiations and exceptions. arXiv preprint arXiv:2205.11658, 2022.
  2. Conversational neuro-symbolic commonsense reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 4902–4911, 2021.
  3. Can retriever-augmented language models reason? the blame game between the retriever and the language model. arXiv preprint arXiv:2212.09146, 2022.
  4. Critical thinking for language models. In Proceedings of the 14th International Conference on Computational Semantics (IWCS), pages 63–75, Groningen, The Netherlands (online), June 2021. Association for Computational Linguistics.
  5. Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739, 2019.
  6. Genericskb: A knowledge base of generic statements. arXiv preprint arXiv:2005.00660, 2020.
  7. Argumentation and defeasible reasoning in the law. J, 4(4):897–914, 2021.
  8. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  9. PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20, 2021.
  12. Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations, 2023.
  13. Explaining answers with entailment trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7358–7370, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
  16. Defeasible reasoning in web-based forms through argumentation. International Journal of Information Technology & Decision Making, 7(01):71–101, 2008.
  17. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
  18. FOLIO: Natural language reasoning with first-order logic. arXiv:2209.00840, 2022.
  19. On a flexible representation for defeasible reasoning variants. In AAMAS 2018-17th International Conference on Autonomous Agents and MultiAgent Systems, number AAMAS’18, pages 1123–1131, 2018.
  20. Carl Hewitt. Planner: A language for proving theorems in robots. In Proceedings of the 1st International Joint Conference on Artificial Intelligence, IJCAI’69, page 295–301, San Francisco, CA, USA, 1969. Morgan Kaufmann Publishers Inc.
  21. Inferring implicit relations in complex questions with language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2548–2566, 2022.
  22. Lambada: Backward chaining for automated reasoning in natural language. In ACL, 2023.
  23. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2023.
  24. Entity tracking in language models. arXiv preprint arXiv:2305.02363, 2023.
  25. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  26. Think about it! improving defeasible reasoning by first modeling the question scenario. arXiv preprint arXiv:2110.12349, 2021.
  27. Rethinking defeasible reasoning: A scalable approach. Theory and Practice of Logic Programming, 20(4):552–586, 2020.
  28. John McCarthy. Programs with common sense. In Proceedings of the Teddington Conference on the Mechanization of Thought Processes, pages 75–91, London, 1959. Her Majesty’s Stationary Office.
  29. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop, 2022.
  30. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295, 2023.
  31. John L Pollock. Defeasible reasoning. Cognitive science, 11(4):481–518, 1987.
  32. David Poole. A logical framework for default reasoning. Artificial intelligence, 36(1):27–47, 1988.
  33. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597, 2022.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  35. Raymond Reiter. Nonmonotonic reasoning. In Exploring artificial intelligence, pages 439–481. Elsevier, 1988.
  36. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022.
  37. Thinking like a skeptic: Defeasible inference in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4661–4675, 2020.
  38. RuleBERT: Teaching soft rules to pre-trained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1460–1476, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  39. Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In The Eleventh International Conference on Learning Representations, 2023.
  40. Testing the general deductive reasoning capacity of large language models using ood examples. arxiv preprint arXiv:2305.15269, 2023.
  41. Giovanni Sartor. Defeasibility in legal reasoning. Springer, 1995.
  42. J.R. Shoenfield. Mathematical Logic. Taylor & Francis, 2001.
  43. Clutrr: A diagnostic benchmark for inductive reasoning from text. arXiv preprint arXiv:1908.06177, 2019.
  44. Natural language deduction with incomplete information. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8230–8258, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  45. Conditionalqa: A complex reading comprehension dataset with conditional answers. arXiv preprint arXiv:2110.06884, 2021.
  46. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621–3634, Online, August 2021. Association for Computational Linguistics.
  47. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
  48. Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. Advances in Neural Information Processing Systems, 33:20227–20237, 2020.
  49. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2714–2730, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  50. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022.
  51. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.
  52. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  53. Neural story planning. arXiv preprint arXiv:2212.08718, 2022.
  54. Nature language reasoning, a survey. arXiv preprint arXiv:2303.14725, 2023.
  55. STaR: Bootstrapping reasoning with reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 15476–15488. Curran Associates, Inc., 2022.
  56. Improved logical reasoning of language models via differentiable symbolic programming. In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022, 2022.
  57. On the paradox of learning to reason from data. arXiv:2205.11502, 2022.
  58. AR-LSAT: Investigating analytical reasoning of text. arXiv preprint arXiv:2104.06598, 2021.
Citations (25)

Summary

We haven't generated a summary for this paper yet.