Papers
Topics
Authors
Recent
Search
2000 character limit reached

SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits

Published 17 Dec 2024 in cs.CL | (2412.13378v2)

Abstract: Detecting factual inconsistencies in summarization is critical, yet existing benchmarks lack the necessary challenge and interpretability for robust evaluation. In this paper, we introduce SummExecEdit, a novel pipeline and benchmark leveraging executable edits to assess models on their ability to both detect factual errors and provide accurate explanations. The top-performing model, Claude3-Opus, achieves a joint detection and explanation score of only 0.49 in our benchmark, with individual scores of 0.67 for detection and 0.73 for explanation. We conduct detailed evaluations to assess the current state of models in this field and find that more than half of the 20+ LLMs in our study struggle with over 30% of the SummExecEdit benchmark. Additionally, we identify four primary types of explanation errors, with 45.4% of them involving a focus on completely unrelated parts of the summary.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.