Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WikiWhy: Answering and Explaining Cause-and-Effect Questions (2210.12152v2)

Published 21 Oct 2022 in cs.CL and cs.AI

Abstract: As LLMs grow larger and more sophisticated, assessing their "reasoning" capabilities in natural language grows more challenging. Recent question answering (QA) benchmarks that attempt to assess reasoning are often limited by a narrow scope of covered situations and subject matters. We introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. Each rationale is a set of supporting statements connecting the question to the answer. WikiWhy serves as a benchmark for the reasoning capabilities of LLMs because it demands rigorous explicit rationales for each answer to demonstrate the acquisition of implicit commonsense knowledge, which is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition, leaving significant room for future improvements.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Matthew Ho (28 papers)
  2. Aditya Sharma (32 papers)
  3. Justin Chang (3 papers)
  4. Michael Saxon (27 papers)
  5. Sharon Levy (22 papers)
  6. Yujie Lu (42 papers)
  7. William Yang Wang (254 papers)
Citations (16)

Summary

  • The paper introduces the WIKIWHY dataset featuring over 9,000 cause-effect QA pairs with detailed rationales from Wikipedia.
  • It reveals that advanced LLMs struggle to generate high-quality explanations, with human evaluations scoring them below 40% satisfaction.
  • New evaluation metrics are proposed to measure commonsense reasoning in explanations, guiding future improvements in language models.

Introduction to the Challenge

LLMs are continually pushed to their limits as new benchmarks are designed to assess their capabilities. Particularly in the domain of question answering (QA), benchmarks have become an important tool to measure models' abilities to reason with and about language. However, assessing the reasoning capability of models is an enormous and complex task, especially when attempting to understand their grasp of commonsense knowledge and reasoning.

The Significance of 'Why' Questions

One of the critical components of reasoning is being able to explain why something is the case. 'Why' questions are unique in that they demand an explanation that is more than just a retrieval of facts—they require an understanding of cause and effect.

'Why' questions are often underrepresented in QA datasets, despite being essential to reasoning. Existing datasets tend to focus on questions that can be answered with concrete facts, such as "Who", "What", "When", and "Where". These questions can usually be answered by patterns that do not rely on a deep understanding or reasoning about how things work.

Introducing the WIKIWHY Dataset

The WIKIWHY dataset addresses this gap by focusing on generating explanations for cause-and-effect relations. Each entry in the dataset consists of a question-answer pair, along with a rationale that explains the cause leading to the effect. Built upon facts sourced from Wikipedia across diverse topics, the dataset comprises over 9,000 such QA rationale triples.

The dataset not only presents an opportunity to challenge LLMs but also to explore the implicit commonsense assumptions that may not be evident in texts alone. The ability to infer and articulate the reasoning behind cause-and-effect relationships is a novel task formulated to probe LLMs' reasoning abilities.

Experiments and Key Findings

When evaluating the WIKIWHY dataset, experiments revealed that even sophisticated generative models like GPT-3 encounter difficulties when attempting to produce high-quality explanations for the cause-effect relations. Human evaluations further highlighted the scope for improvement, noting that the models were satisfactory less than 40% of the time.

Explanations generated by LLMs often simply reiterated the causal relationship rather than explaining the mechanics underneath. This observation highlights a potential area of focus in understanding and improving commonsense reasoning within LLMs. The researchers also introduced new evaluation metrics to better assess the quality of reasoning demonstrated by LLMs.

Conclusion

The creation of the WIKIWHY dataset represents a significant step toward better understanding the reasoning abilities of LLMs. It also sheds light on the complexities involved in generating explanations that require an understanding of cause-and-effect relationships. As AI continues to evolve, resources like WIKIWHY provide valuable benchmarks for facilitating and tracking progress in AI reasoning capabilities.