Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rule or Story, Which is a Better Commonsense Expression for Talking with Large Language Models? (2402.14355v2)

Published 22 Feb 2024 in cs.CL

Abstract: Building machines with commonsense has been a longstanding challenge in NLP due to the reporting bias of commonsense rules and the exposure bias of rule-based commonsense reasoning. In contrast, humans convey and pass down commonsense implicitly through stories. This paper investigates the inherent commonsense ability of LLMs expressed through storytelling. We systematically investigate and compare stories and rules for retrieving and leveraging commonsense in LLMs. Experimental results on 28 commonsense QA datasets show that stories outperform rules as the expression for retrieving commonsense from LLMs, exhibiting higher generation confidence and commonsense accuracy. Moreover, stories are the more effective commonsense expression for answering questions regarding daily events, while rules are more effective for scientific questions. This aligns with the reporting bias of commonsense in text corpora. We further show that the correctness and relevance of commonsense stories can be further improved via iterative self-supervised fine-tuning. These findings emphasize the importance of using appropriate language to express, retrieve, and leverage commonsense for LLMs, highlighting a promising direction for better exploiting their commonsense abilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. AllenAI. 2017. Ai2 science questions v2.1 (october 2017). http://data.allenai.org/ai2-science-questions/.
  2. PROST: Physical reasoning about objects through space and time. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4597–4608, Online. Association for Computational Linguistics.
  3. Abductive commonsense reasoning. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  4. Prabin Bhandari and Hannah Marie Brennan. 2023. Trustworthiness of children stories generated by large language models. ArXiv preprint, abs/2308.00073.
  5. Benchmarking knowledge-enhanced commonsense question answering via knowledge-to-text transformation. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 12574–12582. AAAI Press.
  6. Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. ArXiv preprint, abs/2303.16421.
  7. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press.
  8. Dynamic neuro-symbolic knowledge graph construction for zero-shot commonsense question answering. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 4923–4931. AAAI Press.
  9. R.J. Brachman and H.J. Levesque. 2023. Machines like Us: Toward AI with Common Sense. MIT Press.
  10. Ronald J. Brachman and Hector J. Levesque. 2022. Toward a new science of common sense. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 12245–12249. AAAI Press.
  11. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  12. Language and Myth. Dover Books on Literature, Philosophy, History, Religion. Dover Publications.
  13. Incorporating commonsense knowledge graph in pretrained models for social commonsense tasks. In Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 74–79, Online. Association for Computational Linguistics.
  14. Say what you mean! large language models speak too positively about negative commonsense knowledge. ArXiv preprint, abs/2305.05976.
  15. CODAH: An adversarially-authored question answering dataset for common sense. In Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP, pages 63–69, Minneapolis, USA. Association for Computational Linguistics.
  16. Distinguish before answer: Generating contrastive explanation as knowledge for commonsense question answering. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13207–13224, Toronto, Canada. Association for Computational Linguistics.
  17. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  18. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv preprint, abs/1803.05457.
  19. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  20. Nam Do and Ellie Pavlick. 2021. Are rotten apples edible? challenging commonsense inference ability with exceptions. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2061–2073, Online. Association for Computational Linguistics.
  21. Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? ArXiv preprint, abs/2305.07759.
  22. Jonathan Gordon and Benjamin Van Durme. 2013. Reporting bias and knowledge acquisition. In Proceedings of the 2013 workshop on Automated knowledge base construction, pages 25–30.
  23. DREAM: Improving situational QA by first elaborating the situation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1115–1127, Seattle, United States. Association for Computational Linguistics.
  24. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  25. Syntactic and semantic modulation of neural activity during auditory sentence comprehension. Journal of cognitive neuroscience, 18(4):665–679.
  26. (comet-) atomic 2020: On symbolic and neural commonsense knowledge graphs. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 6384–6392. AAAI Press.
  27. Cskg: The commonsense knowledge graph. In The Semantic Web: 18th International Conference, ESWC 2021, Virtual Event, June 6–10, 2021, Proceedings 18, pages 680–696. Springer.
  28. Crow: Benchmarking commonsense reasoning in real-world tasks. ArXiv preprint, abs/2310.15239.
  29. Mistral 7b. ArXiv preprint, abs/2310.06825.
  30. How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering. Transactions of the Association for Computational Linguistics, 9:962–977.
  31. StoryAnalogy: Deriving story-level analogies from large language models to unlock analogical understanding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11518–11537, Singapore. Association for Computational Linguistics.
  32. QASC: A dataset for question answering via sentence composition. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8082–8090. AAAI Press.
  33. Gary Klein. 2004. The Power of Intuition: How to Use Your Gut Feelings to Make Better Decisions at Work. Currency/Doubleday.
  34. A sentence supramodal areas atlas (sensaas) based on multiple task-induced activation mapping and graph analysis of intrinsic connectivity in 144 healthy right-handers. Brain Structure and Function, 224(2):859–882.
  35. Veronica Latcinnik and Jonathan Berant. 2020. Explaining question answering models through text generation. ArXiv preprint, abs/2004.05569.
  36. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
  37. A systematic investigation of commonsense knowledge in large language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11838–11855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  38. Guiding large language models via directional stimulus prompting. ArXiv preprint, abs/2302.11520.
  39. KagNet: Knowledge-aware graph networks for commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2829–2839, Hong Kong, China. Association for Computational Linguistics.
  40. Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-Trained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6862–6868, Online. Association for Computational Linguistics.
  41. Rainier: Reinforced knowledge introspector for commonsense question answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8938–8958, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  42. Generated knowledge prompting for commonsense reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3154–3169, Dublin, Ireland. Association for Computational Linguistics.
  43. Crystal: Introspective reasoners reinforced with self-feedback. ArXiv preprint, abs/2310.04921.
  44. Vera: A general-purpose plausibility estimation model for commonsense statements. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1264–1287, Singapore. Association for Computational Linguistics.
  45. Mind’s mirror: Distilling self-evaluation capability and comprehensive thinking from large language models. ArXiv preprint, abs/2311.09214.
  46. Graph-based reasoning over heterogeneous external knowledge for commonsense question answering. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8449–8456. AAAI Press.
  47. Knowledge-driven data construction for zero-shot evaluation in commonsense question answering. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 13507–13515. AAAI Press.
  48. John McCarthy. 1959. Programs with common sense. In Proceedings of the Teddington Conference on the Mechanization of Thought Processes, pages 75–91, London. Her Majesty’s Stationary Office.
  49. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
  50. How additional knowledge can improve natural language commonsense question answering? ArXiv preprint, abs/1909.08855.
  51. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California. Association for Computational Linguistics.
  52. OpenAI. 2022. Introducing chatgpt. 2022, Nov 30.
  53. Cortical representation of the constituent structure of sentences. Proceedings of the National Academy of Sciences, 108(6):2522–2527.
  54. Prompting contrastive explanations for commonsense reasoning tasks. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4179–4192, Online. Association for Computational Linguistics.
  55. Inferring the reader: Guiding automated story generation with commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7008–7029, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  56. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy. Association for Computational Linguistics.
  57. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
  58. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732–8740. AAAI Press.
  59. ATOMIC: an atlas of machine commonsense for if-then reasoning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 3027–3035. AAAI Press.
  60. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China. Association for Computational Linguistics.
  61. Daniel L Schacter and Donna Rose Addis. 2007. The cognitive neuroscience of constructive memory: remembering the past and imagining the future. Philosophical Transactions of the Royal Society B: Biological Sciences, 362:773–786.
  62. R. C. Schank and R. P. Abelson. 1977. Scripts, Plans, Goals and Understanding: An Inquiry into Human Knowledge Structures. Lawrence Erlbaum, Hillsdale, NJ.
  63. R.C. Schank. 1995. Tell Me a Story: Narrative and Intelligence. Rethinking theory. Northwestern University Press.
  64. Roger C Schank. 1983. Dynamic memory: A theory of reminding and learning in computers and people. cambridge university press.
  65. Qadynamics: Training dynamics-driven synthetic qa diagnostic for zero-shot commonsense question answering. ArXiv preprint, abs/2310.11303.
  66. Vered Shwartz and Yejin Choi. 2020. Do neural language models overcome reporting bias? In Proceedings of the 28th International Conference on Computational Linguistics, pages 6863–6870, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  67. Unsupervised commonsense question answering with self-talk. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4615–4629, Online. Association for Computational Linguistics.
  68. Open mind common sense: Knowledge acquisition from the general public. In On the Move to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA, CoopIS and ODBASE 2002, page 1223–1237, Berlin, Heidelberg. Springer-Verlag.
  69. COM2SENSE: A commonsense reasoning benchmark with complementary sentences. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 883–898, Online. Association for Computational Linguistics.
  70. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pages 4444–4451. AAAI Press.
  71. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ArXiv preprint, abs/2206.04615.
  72. QUAREL: A dataset and models for answering questions about qualitative relationships. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 7063–7071. AAAI Press.
  73. QuaRTz: An open-domain dataset of qualitative relationship questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5941–5946, Hong Kong, China. Association for Computational Linguistics.
  74. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  75. Commonsenseqa 2.0: Exposing the limits of ai through gamification. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  76. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  77. Lamda: Language models for dialog applications. ArXiv preprint, abs/2201.08239.
  78. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971.
  79. Endel Tulving. 2002. Episodic memory: From mind to brain. Annual review of psychology, 53:1–25.
  80. Does it make sense? and why? a pilot study for sense making and explanation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4020–4026, Florence, Italy. Association for Computational Linguistics.
  81. Boosting language models reasoning with chain-of-knowledge prompting. ArXiv preprint, abs/2306.06427.
  82. Pinto: Faithful language reasoning using prompt-generated rationales. In The Eleventh International Conference on Learning Representations.
  83. Scott: Self-consistent chain-of-thought distillation. ArXiv preprint, abs/2305.01879.
  84. Car: Conceptualization-augmented reasoner for zero-shot commonsense question answering. ArXiv preprint, abs/2305.14869.
  85. CAT: A contextualized conceptualization and instantiation framework for commonsense reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13111–13140, Toronto, Canada. Association for Computational Linguistics.
  86. Self-instruct: Aligning language model with self generated instructions. ArXiv preprint, abs/2212.10560.
  87. Dynamic heterogeneous-graph reasoning with language models and knowledge representation learning for commonsense question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14048–14063, Toronto, Canada. Association for Computational Linguistics.
  88. COLA: Contextualized commonsense causal reasoning from the causal inference perspective. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5253–5271, Toronto, Canada. Association for Computational Linguistics.
  89. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  90. Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark. Association for Computational Linguistics.
  91. GROVE: A retrieval-augmented complex story generation framework with a forest of evidence. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3980–3998, Singapore. Association for Computational Linguistics.
  92. Symbolic knowledge distillation: from general language models to commonsense models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4602–4625, Seattle, United States. Association for Computational Linguistics.
  93. The next chapter: A study of large language models in storytelling. In Proceedings of the 16th International Natural Language Generation Conference, pages 323–351.
  94. Knowledge rumination for pre-trained language models. ArXiv preprint, abs/2305.08732.
  95. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 535–546, Online. Association for Computational Linguistics.
  96. Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations.
  97. Reviewing the functional basis of the syntactic merge mechanism for language: A coordinate-based activation likelihood estimation meta-analysis. Neuroscience & Biobehavioral Reviews, 80:646–656.
  98. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104, Brussels, Belgium. Association for Computational Linguistics.
  99. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  100. A social-semantic working-memory account for two canonical language areas. Nature Human Behaviour, 7:1980–1997.
  101. Opt: Open pre-trained transformer language models. ArXiv preprint, abs/2205.01068.
  102. From heuristic to analytic: Cognitively motivated strategies for coherent physical commonsense reasoning. ArXiv preprint, abs/2310.18364.
  103. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685.
  104. Improving question answering by commonsense-based pre-training. In Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9–14, 2019, Proceedings, Part I 8, pages 16–28. Springer.
  105. Think before you speak: Explicitly generating implicit commonsense knowledge for response generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1237–1252, Dublin, Ireland. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ning Bian (8 papers)
  2. Xianpei Han (103 papers)
  3. Hongyu Lin (94 papers)
  4. Yaojie Lu (61 papers)
  5. Ben He (37 papers)
  6. Le Sun (111 papers)
Citations (1)