- The paper introduces GRANOLA QA, a multi-granularity evaluation method that captures varying levels of factual detail in language model responses.
- It augments existing datasets to create GRANOLA-EQ with over 12,000 examples, enhancing the measurement of LLM knowledge.
- The study implements DRAG decoding, which nearly improves accuracy by 20 points by aligning answer granularity with model uncertainty.
Overview of GRANOLA QA
Understanding has recently expanded around the evaluation methods applied to LLMs in the field of open-domain question answering (QA). Traditional QA systems have been limited to singular levels of granularity when providing answers to factual questions, leading to a potential underestimation of the knowledge held within LLMs. To address this bottleneck, a new evaluation approach called GRANOLA QA has been proposed, which stands for GRANularity Of LAbels. GRANOLA QA encompasses an ordered set of multi-granularity answers, moving from the most detailed to the coarsest level of information that is still factually correct.
Evaluation Metrics and Methodology
The concept of GRANOLA QA pivots on the axis of accuracy and informativeness. Accuracy is binary, depending on whether any of the multi-granularity answers is matched, while informativeness is measured through a weighted score that prioritizes detailed correct answers over their broader counterparts. To realize GRANOLA QA, the authors devised a novel methodology to augment an existing QA dataset with higher levels of abstractions. This methodology uses external knowledge graphs to extract entity descriptions and employs an LLM to generate multiple granular answers based on these descriptions.
Dataset Creation: GRANOLA-EQ
Applying this new methodology, the paper enriched the ENTITY QUESTIONS dataset to create GRANOLA-EQ, containing over 12,000 QA examples with an average of multiple granularity answers per question. The meticulous process of QA pair generation and refinement speaks to the authors' commitment to producing a high-quality dataset. The resulting GRANOLA-EQ carries the potential to showcase the true depth of knowledge contained within LLMs.
DRAG: Aligning Granularity and Model Uncertainty
The research further introduces a decoding strategy called Decoding with Response Aggregation (DRAG), which aligns the granularity of an LLM's response with its level of uncertainty. DRAG works by initially sampling multiple responses from the model and then distilling these into a single, more general response that best encapsulates the shared information. Experiments illustrated that DRAG nearly improved accuracy by around 20 points on average when evaluated on multi-granularity answers, demonstrating a pronounced gap in the evaluation of LLM knowledge that standard decoding methods fail to discern, particularly with rare entities.
Implications of GRANOLA QA
This paper contributes significantly to the field by offering GRANOLA QA as a refined evaluation tool that better measures the accuracy and informativeness of LLM responses. The generation of the GRANOLA-EQ dataset and the implementation of DRAG decoding are steps towards more nuanced and authentic assessments of LLM capabilities. By recognizing that factual knowledge can be correctly expressed at varying levels of detail, the paper charts a course for future enhancements in LLM evaluation approaches.