$Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering (2104.08202v2)

Published 16 Apr 2021 in cs.CL

Abstract: Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted $Q^2$, compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper evaluation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of $Q^2$ against other metrics using this dataset and two others, where it consistently shows higher correlation with human judgements.

Authors (6)

Or Honovich (9 papers)
Leshem Choshen (78 papers)
Roee Aharoni (35 papers)
Ella Neeman (2 papers)
Idan Szpektor (47 papers)
Omri Abend (75 papers)

Citations (133)

View on Semantic Scholar

Summary

Evaluating Factual Consistency in Knowledge-Grounded Dialogues with Q2

The research paper introduces Q2, an innovative evaluation metric for assessing the factual consistency of knowledge-grounded dialogue systems. Traditional generative dialogue models frequently produce content that diverges from the factual information they are based on, which undermines their reliability. This inconsistency is often ignored by standard evaluation metrics, which focus more on fluency and coherence than factual accuracy. Inspired by methods used in abstractive summarization, the authors propose Q2, a metric that integrates automatic Question Generation (QG) and Question Answering (QA) with Natural Language Inference (NLI) to effectively identify factual consistencies in dialogues.

The authors curate a new dataset derived from the Wizard of Wikipedia (WOW) and other datasets, annotated for factual consistency, to validate Q2. The method Q2 differentiates itself by employing NLI for span comparison rather than simple token-level matching used previously. This approach offers a more robust method for evaluating responses with variations in lexical expressions.

Key Findings and Contributions

Development of Q2 Metric:
- Q2 performs a nuanced consistency evaluation by generating questions from dialogue responses, finding corresponding answers in grounding knowledge, and using NLI to compare the two answer spans. This allows for more nuanced assessments that can handle lexical variability effectively.
Annotated Dataset for Factual Consistency:
- A unique dataset is compiled by selecting dialogues from the WOW dataset and others, annotating these for factual consistency. The dataset serves as a groundwork for evaluating the Q2 metric against human judgments.
Empirical Validation:
- Extensive experiments compare Q2 with other reference-free metrics across three dialogue datasets: Wizard of Wikipedia, Topical-Chat, and Dialogue NLI (DNLI). Q2 consistently shows higher correlation with human judgments, attesting to its efficacy and robustness.
Robustness and Interpretation:
- The Q2 metric showcases resilience against changes in the underlying QG and QA models, and provides interpretable outputs by identifying specific spans that may be factually inconsistent.
Implications for Dialogue System Evaluation:
- Q2's application does not necessitate reference answers, making it a practical choice for open-domain dialogues where standard references may not be applicable.
- The incorporation of NLI lends robustness by capturing semantic nuances and reducing dependence on token-level overlap which may miss semantic equivalence due to lexical diversity.

Practical and Theoretical Implications

The paper heralds a significant step towards more reliable dialogue systems by focusing on the factual consistency of generated content. Practically, Q2 could be instrumental in refining dialogue systems to minimize the risk of misinformation. This is crucial in deploying such systems in information-sensitive applications like health and legal advisories.

Theoretically, the integration of QA and NLI for assessing generated text could spur new research directions in evaluating other forms of generated content, such as machine translation and content summarization. Q2 hybridizes existing methodologies with advanced NLP techniques, potentially paving the way for its application in other fact-sensitive tasks like automated fact-checking.

Future Directions

The paper suggests that future work may involve enhancing Q2's ability to discern varying dialogue components such as chit-chat, personal statements, and factual content. Such advancements could refine its applicability further, allowing for even more accurate discernment in complex utterances. Additionally, leveraging Q2 as a tool to improve the training and robustness of generative models presents a promising avenue for enhancing factual fidelity in AI dialogue agents.

In summary, Q2 provides a comprehensive and reliable metric for ensuring factual consistency in dialogues, bridging the gap between human evaluations and algorithmic assessments, and stands as a viable solution for enhancing the credibility of generative systems in knowledge-grounded tasks.

PDF Markdown

Related Papers

GitHub

GitHub - orhonovich/q-squared (29 stars)