Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Post-Hoc Answer Attribution for Grounded and Trustworthy Long Document Comprehension: Task, Insights, and Challenges (2406.06938v1)

Published 11 Jun 2024 in cs.CL

Abstract: Attributing answer text to its source document for information-seeking questions is crucial for building trustworthy, reliable, and accountable systems. We formulate a new task of post-hoc answer attribution for long document comprehension (LDC). Owing to the lack of long-form abstractive and information-seeking LDC datasets, we refactor existing datasets to assess the strengths and weaknesses of existing retrieval-based and proposed answer decomposition and textual entailment-based optimal selection attribution systems for this task. We throw light on the limitations of existing datasets and the need for datasets to assess the actual performance of systems on this task.

Citations (1)

Summary

  • The paper introduces a post-hoc answer attribution task that links generated answers directly to their evidence in long documents.
  • It proposes the ADiOSAA system which leverages ChatGPT-based decomposition and RoBERTa-L entailment to identify supporting sentences.
  • Evaluations show ADiOSAA achieves higher precision than retrieval-based models, highlighting its potential for trustworthy QA systems.

Post-Hoc Answer Attribution for Grounded and Trustworthy Long Document Comprehension: Task, Insights, and Challenges

This paper addresses the problem of attributing answer text to its source document in the context of long document comprehension (LDC). The authors propose a novel task of post-hoc answer attribution, aimed at enhancing the trustworthiness, reliability, and accountability of question-answering (QA) systems. The work encompasses the formulation of this task, the adaptation of existing datasets to evaluate the performance of proposed systems, and the introduction of a sophisticated answer decomposition and textual entailment-based optimal selection attribution system named ADiOSAA.

Introduction

Automatic QA systems, embedded in various platforms like search engines and digital assistants, play a pivotal role in meeting users' information needs. However, these systems frequently generate responses that lack adequate grounding in verified knowledge sources, posing significant risks of misinformation and hallucination. The paper emphasizes the necessity of attributing generated answers to their respective sources to build systems that are verifiable and accountable.

Task Definition

The task of post-hoc answer attribution for LDC is formally defined as follows: Given a triplet consisting of a query, a set of sentences from a document, and an answer, the goal is to identify supporting sentences in the document that provide evidence for each sentence in the answer. Crucially, this involves identifying fine-grained attributions, meaning that each sentence in the answer may be supported by multiple or no sentences from the source document.

Data Preparation

The paper identifies the absence of suitable datasets for the proposed task, leading to the adaptation of two existing datasets: the Citation Verifiability dataset and the Hagrid dataset. These datasets are reformulated to fit the needs of the task.

  1. Citation Verifiability Dataset: Derived from questions and answers embedded with inline citations. This dataset necessitates human annotations to judge if citations fully support the answer sentences.
  2. Hagrid Dataset: Constructed using a collaborative approach involving human annotators and LLMs to generate highly informative and attributed answers.

Proposed Attribution System: ADiOSAA

The ADiOSAA system is introduced to tackle the task of fine-grained post-hoc answer attribution. It consists of two main components:

  1. Answer Decomposer: Utilizes ChatGPT to break down each sentence of an answer into smaller, independent information units, facilitating more precise attributions.
  2. Attributor: Leverages a RoBERTa-L model pretrained on DocNLI to identify supporting sentences in the source document for each information unit. The entailment task is framed as determining if the given document sentences can support the hypothesis (i.e., information units).

An optimal selection algorithm is employed to handle cases where answer sentences are composed of information from multiple sentences in the document. This algorithm incrementally selects sentences from the document that maximize the likelihood of entailment, ensuring a comprehensive attribution.

Evaluation and Results

Three baseline models—BM25, GTR, and MonoT5—along with several variants of ADiOSAA, are evaluated using precision, recall, and F1 scores. Notably, the ADiOSAA and its variants achieve higher precision compared to retrieval-based systems when more than one prediction is considered, demonstrating its capability in capturing abstractive and compositional attributions more effectively.

Key Findings:

  • Optimal Selection: Significantly enhances the performance by accurately attributing composite answer sentences to relevant document sentences.
  • Answer Decomposer: Improves precision, showing its effectiveness in breaking down complex answers into simpler units for better attribution.
  • Dataset Insights: Indicates a need for more challenging, highly abstractive long-form reading comprehension datasets to drive advancements in trustworthy QA systems.

Implications and Future Directions

The research underscores important implications for both practical applications and theoretical advancements in AI. Trustworthy QA systems can significantly mitigate misinformation risks, a critical requirement for deploying such systems in real-world applications. The ADiOSAA system, by enhancing groundability and accountability, sets a foundation for further innovations in this domain.

Future research should focus on developing more abstractive datasets, creating interactive systems for decomposition and attribution, and possibly integrating supervised learning approaches tailored for this task. Exploring the performance dependencies of attribution models on entailment quality could also yield substantial improvements.

Conclusion

The paper makes substantial contributions by framing a novel task, meticulously adapting existing datasets, and proposing an innovative system that outperforms existing retrieval-based methods in the context of LDC. The ADiOSAA system, with its emphasis on comprehensiveness and reliability, presents a significant step forward in the development of grounded QA systems. This research highlights the ongoing need for enhanced data, methodology, and evaluation standards to ensure that QA systems can be trusted in real-world deployment scenarios.