AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Sample Generation
"AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation," introduces a sophisticated framework leveraging Abstract Meaning Representations (AMRs) to address specific shortcomings in factuality evaluation of abstractive summarization systems. This research targets a pervasive issue in abstractive summarization: the generation of factually inconsistent summaries. Prior methods commonly employ entailment-based approaches, generating perturbed summaries that frequently lack coherence or do not adequately cover various types of factual errors. AMRFact posits a novel solution by utilizing AMR-based perturbations to improve the quality and error-type coverage of factually inconsistent summary generation.
Methodology
AMRFact employs a systematic approach for generating negative samples with a focus on high coherence and comprehensive error-type coverage:
- AMR Parsing and Manipulation: The process begins with parsing factually consistent summaries into AMR graphs. These graphs are then perturbed with controlled factual inconsistencies to produce negative examples.
- Negative Sample Generation: Through semantic-level perturbations, the framework creates coherent factually inconsistent summaries. This is achieved without sacrificing error-type coverage, a common issue in previous string-replacement-based methods.
- Filtering with NegFilter: The innovation of NegFilter represents a pivotal step in ensuring the validity and quality of generated negative samples. It applies natural language inference and BARTScore evaluations to filter out samples that do not meet specific criteria for being considered valid negative examples.
- Model Training: A RoBERTa-based model is fine-tuned on the balanced dataset of positive and filtered negative samples to evaluate the factuality of summaries.
Experimental Results
The researchers evaluated AMRFact using the AggreFact-FtSota benchmark, demonstrating substantial improvements over existing systems. Specifically, AMRFact achieved state-of-the-art performance on the CNN/Daily Mail split, with a balanced accuracy significantly outperforming prior systems by 2.1%. The experiments underscore the efficacy of AMRFact in navigating the complexities of factual inconsistency detection across various summarization systems.
Implications and Future Directions
AMRFact's contribution is significant both practically and theoretically. Practically, it provides a more reliable framework for generating synthetic data used to train factuality evaluators, potentially reducing the prevalence of factually inconsistent summaries in real-world applications. Theoretically, the adoption of AMR demonstrates the utility of graph-based semantic representations in tackling complex natural language understanding tasks, reinforcing the importance of semantic abstraction in enhancing machine understanding.
Future research could focus on expanding the scope of AMRFact to include multilingual datasets, further validating its applicability across diverse linguistic contexts. Additionally, integrating AMR-based frameworks with LLMs could enhance the robustness and factual alignment of generated content.
Conclusion
This paper presents a meticulous approach to enhancing summarization factuality evaluation through AMR-driven negative samples. By addressing critical issues of coherence and error-type coverage in existing systems, AMRFact establishes itself as a leading method for developing sophisticated factuality evaluation metrics, paving the way for future advancements in the field of natural language generation and understanding.