Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers (2401.04695v2)

Published 9 Jan 2024 in cs.CL

Abstract: Factual questions typically can be answered correctly at different levels of granularity. For example, both August 4, 1961'' and1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the EntityQuestions dataset. We evaluate a range of decoding methods on GRANOLA-EQ, including a new algorithm, called Decoding with Response Aggregation (DRAG), that is geared towards aligning the response granularity with the model's uncertainty. Our experiments show that LLMs with standard decoding tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities. Overall, this reveals that standard evaluation and decoding schemes may significantly underestimate the knowledge encapsulated in LMs.

Authors (3)

Gal Yona (21 papers)
Roee Aharoni (35 papers)
Mor Geva (58 papers)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces GRANOLA QA, a multi-granularity evaluation method that captures varying levels of factual detail in language model responses.
It augments existing datasets to create GRANOLA-EQ with over 12,000 examples, enhancing the measurement of LLM knowledge.
The study implements DRAG decoding, which nearly improves accuracy by 20 points by aligning answer granularity with model uncertainty.

Overview of GRANOLA QA

Understanding has recently expanded around the evaluation methods applied to LLMs in the field of open-domain question answering (QA). Traditional QA systems have been limited to singular levels of granularity when providing answers to factual questions, leading to a potential underestimation of the knowledge held within LLMs. To address this bottleneck, a new evaluation approach called GRANOLA QA has been proposed, which stands for GRANularity Of LAbels. GRANOLA QA encompasses an ordered set of multi-granularity answers, moving from the most detailed to the coarsest level of information that is still factually correct.

Evaluation Metrics and Methodology

The concept of GRANOLA QA pivots on the axis of accuracy and informativeness. Accuracy is binary, depending on whether any of the multi-granularity answers is matched, while informativeness is measured through a weighted score that prioritizes detailed correct answers over their broader counterparts. To realize GRANOLA QA, the authors devised a novel methodology to augment an existing QA dataset with higher levels of abstractions. This methodology uses external knowledge graphs to extract entity descriptions and employs an LLM to generate multiple granular answers based on these descriptions.

Dataset Creation: GRANOLA-EQ

Applying this new methodology, the paper enriched the ENTITY QUESTIONS dataset to create GRANOLA-EQ, containing over 12,000 QA examples with an average of multiple granularity answers per question. The meticulous process of QA pair generation and refinement speaks to the authors' commitment to producing a high-quality dataset. The resulting GRANOLA-EQ carries the potential to showcase the true depth of knowledge contained within LLMs.

DRAG: Aligning Granularity and Model Uncertainty

The research further introduces a decoding strategy called Decoding with Response Aggregation (DRAG), which aligns the granularity of an LLM's response with its level of uncertainty. DRAG works by initially sampling multiple responses from the model and then distilling these into a single, more general response that best encapsulates the shared information. Experiments illustrated that DRAG nearly improved accuracy by around 20 points on average when evaluated on multi-granularity answers, demonstrating a pronounced gap in the evaluation of LLM knowledge that standard decoding methods fail to discern, particularly with rare entities.

Implications of GRANOLA QA

This paper contributes significantly to the field by offering GRANOLA QA as a refined evaluation tool that better measures the accuracy and informativeness of LLM responses. The generation of the GRANOLA-EQ dataset and the implementation of DRAG decoding are steps towards more nuanced and authentic assessments of LLM capabilities. By recognizing that factual knowledge can be correctly expressed at varying levels of detail, the paper charts a course for future enhancements in LLM evaluation approaches.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_galyo/status/1745030593530786139

https://twitter.com/WikiResearch/status/1749861758943007044

https://twitter.com/fly51fly/status/1745097312530301379

https://twitter.com/knishimae0531/status/1745235180993069091

https://twitter.com/TheTuringPost/status/1747636423774314770