Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering (2307.16877v2)

Published 31 Jul 2023 in cs.CL and cs.AI

Abstract: Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness. We then propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Our analysis reveals that instruction-following models are competitive, and sometimes even outperform fine-tuned models for correctness. However, these models struggle to stick to the provided knowledge and often hallucinate in their responses. We hope our work encourages a more holistic evaluation of instruction-following models for QA. Our code and data is available at https://github.com/McGill-NLP/instruct-qa

Citations (88)

View on Semantic Scholar

Summary

The paper introduces token-overlap metrics, namely Recall for correctness and K-Precision for faithfulness, to improve QA model evaluation.
It demonstrates that conventional metrics like EM and F1 often misjudge verbose, yet semantically valid, model responses.
The findings highlight that while models like Flan-T5 show high faithfulness, balancing detailed answers with precision remains challenging.

Evaluation of Instruction-Following Models in Question Answering

The paper "Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering" presents an in-depth examination of instruction-following models in Question Answering (QA), specifically focusing on correctness and faithfulness. Instruction-following models, such as GPT-3.5 and Flan-T5, offer flexibility by adjusting to various tasks without requiring further fine-tuning. However, their propensity to generate verbose, supplementary responses complicates the assessment and reliability of conventional QA metrics like exact match (EM) and F1 score.

Evaluation Metrics

The paper underscores the inadequacy of traditional metrics in evaluating instruction-following models due to their verbose outputs. In response, it introduces token-overlap metrics, specifically Recall for correctness and K-Precision for faithfulness, which exhibit a stronger correlation with human judgments.

Correctness Metrics: Recall, and its strict form, measures token overlap with reference answers, prioritizing the inclusion of all necessary information. Semantic metrics, like BERTScore and model-based evaluations using LLMs, are also discussed, though token-level metrics offer simpler computations aligned closely with human evaluations.
Faithfulness Metrics: Metrics such as K-Precision and K-Precision++ define the level of grounding of model responses based on their adherence to provided knowledge. Alternative metrics like FaithCritic and $Q^2$ demonstrate less efficacy when applied to QA tasks as opposed to knowledge-grounded dialogues.

Human-Centric Evaluation

The human evaluation component revealed deficiencies in traditional metrics, establishing that instruction-following models often frame their responses more verbosely, addressing user queries accurately yet being penalized by metrics tailored to concise answers. Over 50% of model responses regarded as incorrect via EM and F1 were actually semantically valid upon human assessment, primarily due to richer, more elaborate model responses.

Experimental Setup and Findings

The research assessed models across different QA paradigms: open-domain, multi-hop, and conversational, utilizing datasets such as HotpotQA and Natural Questions. By employing retrieval-augmented generation settings, the paper highlighted that instruction-following models were comparable, or even superior, in correctness when evaluated using Recall compared to task-specific fine-tuned baselines like FiD.

In terms of faithfulness, Flan-T5 emerged as the most faithful model across different tasks when measured with K-Precision, though discrepancies between correctness and faithfulness metrics suggest a balance needs to be achieved. The paper also discerns the capacity for models to refrain from answering when presented with irrelevant information, though models frequently failed this test, running counter to instructions to state "I don't know."

Implications and Future Work

This research provides significant insights into the strengths and pitfalls of instruction-following models in QA. While models excel in producing user-facing, informative replies, their ability to bind responses accurately to given knowledge remains a challenge, impacting user trust and system dependability.

Future directions should pivot towards developing more refined and domain-specific evaluation metrics, accounting for the interpretability and adaptability of LLMs in varying contexts. Furthermore, enhancing models’ capabilities to refrain from conjecturing in uncertain knowledge scenarios is essential for improving the reliability of AI systems.

In conclusion, this work propels the discourse on evaluating LLMs beyond the paradigms constrained by traditional scoring systems, urging the exploration of metrics that harmonize with the objectives of modern LLM architectures. Such advancement is imperative for optimizing the utility and evolution of AI as dependable, context-aware systems in information dissemination.