- The paper introduces the WIKIWHY dataset featuring over 9,000 cause-effect QA pairs with detailed rationales from Wikipedia.
- It reveals that advanced LLMs struggle to generate high-quality explanations, with human evaluations scoring them below 40% satisfaction.
- New evaluation metrics are proposed to measure commonsense reasoning in explanations, guiding future improvements in language models.
Introduction to the Challenge
LLMs are continually pushed to their limits as new benchmarks are designed to assess their capabilities. Particularly in the domain of question answering (QA), benchmarks have become an important tool to measure models' abilities to reason with and about language. However, assessing the reasoning capability of models is an enormous and complex task, especially when attempting to understand their grasp of commonsense knowledge and reasoning.
The Significance of 'Why' Questions
One of the critical components of reasoning is being able to explain why something is the case. 'Why' questions are unique in that they demand an explanation that is more than just a retrieval of facts—they require an understanding of cause and effect.
'Why' questions are often underrepresented in QA datasets, despite being essential to reasoning. Existing datasets tend to focus on questions that can be answered with concrete facts, such as "Who", "What", "When", and "Where". These questions can usually be answered by patterns that do not rely on a deep understanding or reasoning about how things work.
Introducing the WIKIWHY Dataset
The WIKIWHY dataset addresses this gap by focusing on generating explanations for cause-and-effect relations. Each entry in the dataset consists of a question-answer pair, along with a rationale that explains the cause leading to the effect. Built upon facts sourced from Wikipedia across diverse topics, the dataset comprises over 9,000 such QA rationale triples.
The dataset not only presents an opportunity to challenge LLMs but also to explore the implicit commonsense assumptions that may not be evident in texts alone. The ability to infer and articulate the reasoning behind cause-and-effect relationships is a novel task formulated to probe LLMs' reasoning abilities.
Experiments and Key Findings
When evaluating the WIKIWHY dataset, experiments revealed that even sophisticated generative models like GPT-3 encounter difficulties when attempting to produce high-quality explanations for the cause-effect relations. Human evaluations further highlighted the scope for improvement, noting that the models were satisfactory less than 40% of the time.
Explanations generated by LLMs often simply reiterated the causal relationship rather than explaining the mechanics underneath. This observation highlights a potential area of focus in understanding and improving commonsense reasoning within LLMs. The researchers also introduced new evaluation metrics to better assess the quality of reasoning demonstrated by LLMs.
Conclusion
The creation of the WIKIWHY dataset represents a significant step toward better understanding the reasoning abilities of LLMs. It also sheds light on the complexities involved in generating explanations that require an understanding of cause-and-effect relationships. As AI continues to evolve, resources like WIKIWHY provide valuable benchmarks for facilitating and tracking progress in AI reasoning capabilities.