Agent-as-Judge for Factual Summarization of Long Narratives
The paper "Agent-as-Judge for Factual Summarization of Long Narratives" introduces an innovative approach to evaluating and refining summaries of extensive textual narratives. The authors identify a significant gap in current summarization evaluation metrics, which often fail to account for the factual accuracy of summaries, particularly in long narratives exceeding 100K tokens. This work presents NarrativeFactScore—a novel "Agent-as-a-Judge" framework designed to enhance factual consistency in LLM-generated summaries and proposes a Character Knowledge Graph (CKG) to fortify agents' judgment abilities.
Introduction and Motivation
The rise of LLMs has significantly impacted summarization tasks, achieving considerable performance in lexical and semantic similarity metrics like ROUGE and BERTScore. Nonetheless, these metrics inadequately measure factual accuracy, leaving narratives susceptible to errors, especially in the intricate field of character relationships and their developments. Prior advances, such as LLM-as-a-Judge, have attempted to fill this void but still demonstrate limitations in consistent factual reasoning.
Proposed Method
The authors propose NarrativeFactScore, utilizing an "Agent-as-a-Judge" framework which leverages a CKG for evaluating factual consistency in story summaries. The CKG is developed by extracting character relationships and states from both source texts and generated summaries. This graph-based approach allows NarrativeFactScore to more accurately assess summaries by incorporating complex character dynamics and making the evaluation process interpretable and actionable.
The core processes involve:
- CKG Extraction: A systematic extraction and unification of names and character relationships across narrative scenes to maintain consistency, inspired by self-consistency reasoning strategies.
- Factuality Scoring: Each summary is decomposed into atomic facts that are validated against the narrative using the CKG, providing a score that measures factual accuracy relative to the original narrative.
- Agent-based Refinement: Utilizing the segmentation of feedback from NarrativeFactScore, summaries undergo iterative refinement for improved accuracy.
Results and Implications
The framework was validated through extensive experimentation on well-established benchmarks, demonstrating superior factuality and consistency over existing methods. The proposed NarrativeFactScore showed statistically significant alignment with human evaluators, exhibiting a correlation with human factuality assessments at a p-value of 0.00003. The results are substantiated by increased factual accuracy and improved performance metrics (ROUGE, BERTScore) when applied to movie scripts and other long-form narrative datasets.
Practical and Theoretical Implications
Practically, the integration of NarrativeFactScore into narrative summarization workflows can enhance the factual reliability of generated content, reducing manual factual verification efforts and costs. Theoretically, the method signifies a shift towards more sophisticated, graph-based comprehension models that allow for intricate evaluation of textual relationships, a necessity for advancing the infrastructure of AI's understanding capabilities.
Future Developments
Moving forward, this agent-guided approach posits the potential for applications beyond summarization, including narrative generation and interacting system evaluations, where tracking character dynamics and plot intricacies are crucial. The paper encourages further exploration into enhancing the breadth and depth of CKGs, enabling more robust, multifaceted narrative understanding and generation within AI systems.
The work distinctly positions itself by addressing the nuanced evaluation of factuality within long narratives, a critical yet underdeveloped area, providing a robust foundation for future AI advancements in natural language understanding and generation tasks.