Evaluation measure for faithfulness in QED Task 4

Develop an evaluation measure that quantifies the faithfulness of model-generated explanations to the underlying reasoning process in QED Task 4, where a system predicts a long answer span, a short answer span, and a structured QED explanation for a given question–document pair.

Background

The paper introduces QED, a linguistically grounded framework for explanations in question answering, and defines four modeling tasks. Task 4 requires that generated explanations be faithful to the model’s reasoning process, in contrast to tasks that merely produce post-hoc rationales.

To compare and advance models that aim for faithful explanations, a formal evaluation measure is necessary. The authors explicitly note that defining such a measure remains an open question and is beyond the scope of the paper, highlighting a gap that must be addressed to make progress on faithful explainability in QA.

References

This will require an evaluation measure for faithfulness, which is an open question beyond the scope of this paper.

— QED: A Framework and Dataset for Explanations in Question Answering (2009.06354 - Lamm et al., 2020) in Section 5.1, Task 4 (Four Tasks)

Evaluation measure for faithfulness in QED Task 4

Background

References

Related Problems