Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (2403.02270v3)

Published 4 Mar 2024 in cs.CL

Abstract: Recent advancements in text summarization, particularly with the advent of LLMs, have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.

Guidelines for Submitting Papers to *ACL Conferences Using LaTeX

Introduction to the Guidelines

The paper provides a comprehensive set of instructions for authors intending to submit their manuscripts to *ACL conferences using the LaTeX document preparation system. It details the essential aspects of formatting, referencing, and structuring documents in accordance with the specified requirements for *ACL proceedings. The document serves both as a guideline and a template that authors can directly employ to prepare their manuscripts, ensuring uniformity and adherence to the publication standards set by the Association for Computational Linguistics (ACL).

LaTeX Template and Styling Details

A major focus of the document is on the LaTeX style files (acl.sty) and template (acl.tex) provided for author use. These files are designed to simplify the process of formatting by automatically applying the correct styles to various elements of the manuscript, including the title, author list, abstract, main text, references, and appendices. The authors are directed to use the PDFLaTeX engine for generating PDF files due to its wide support and compatibility with the typesetting requirements of *ACL conferences. Additionally, the document mentions the suitability of XeLaTeX for manuscripts that contain non-Latin scripts, thus accommodating a broader range of linguistics research.

Structuring the Manuscript

The guidelines outline the structure that manuscripts should follow, beginning with the declaration of the document class and the loading of the acl style file with or without the review option, depending on the stage of submission. Authors are instructed on the formatting of the document's preamble, including the specification of the title and author information. The document emphasizes the importance of adhering to the default font (Times Roman) and the prohibition against modifying default caption sizes for tables and figures.

Handling Special Elements

Several sections of the guidelines are devoted to the correct handling of specialized elements within a manuscript:

  • Footnotes and Hyperlinks: Instructions on creating footnotes and addressing common compilation errors with hyperlinks in PDFLaTeX.
  • Tables and Figures: Detailed formatting rules for tables and figures, including an example table that demonstrates how to correctly input accented characters.
  • Citations and References: Guidance on using the Natbib package for citations within the text and the formatting of the references section. Authors are encouraged to include DOIs or URLs for referenced works to enhance the accessibility and discoverability of cited sources.

Appendices and Acknowledgements

The document provides information on including appendices by switching the section numbering to letters and crediting contributions in the acknowledgements section. It also reflects on the collaborative effort in evolving the manuscript guidelines across various ACL-related conferences, acknowledging contributions from several individuals over the years.

Conclusion and Implications for Future Submissions

The guidelines presented in the document are critical for authors submitting to *ACL conferences, ensuring that all manuscripts maintain a consistent format and meet the required standards. Looking ahead, adherence to these guidelines will simplify the review process, enhance the readability of conference proceedings, and facilitate the dissemination of research findings within the computational linguistics community. As the field continues to evolve, it may be anticipated that these guidelines will be periodically updated to reflect new typesetting technologies, changing publication practices, and the growing diversity of research topics in computational linguistics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Longformer: The long-document transformer. CoRR, abs/2004.05150.
  2. The balanced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recognition, pages 3121–3124.
  3. Evaluating factual consistency of summaries with large language models.
  4. Yanran Chen and Steffen Eger. 2023. Menli: Robust evaluation metrics from natural language inference.
  5. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5055–5070, Online. Association for Computational Linguistics.
  6. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, Seattle, United States. Association for Computational Linguistics.
  7. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  8. Evaluating factual consistency of texts with semantic role labeling.
  9. Gptscore: Evaluate as you desire.
  10. Human-like summarization evaluation with chatgpt.
  11. Automated pyramid summarization evaluation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 404–418, Hong Kong, China. Association for Computational Linguistics.
  12. Trueteacher: Learning factual consistency evaluation with large language models.
  13. Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3592–3603, Online. Association for Computational Linguistics.
  14. Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1449–1462, Online. Association for Computational Linguistics.
  15. LongT5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736, Seattle, United States. Association for Computational Linguistics.
  16. SumPubMed: Summarization dataset of PubMed scientific articles. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 292–303, Online. Association for Computational Linguistics.
  17. Automation of summary evaluation by the pyramid method. In International Conference on Recent Advances in Natural Language Processing, RANLP 2005 - Proceedings, International Conference Recent Advances in Natural Language Processing, RANLP, pages 226–232. Association for Computational Linguistics (ACL). International Conference on Recent Advances in Natural Language Processing, RANLP 2005 ; Conference date: 21-09-2005 Through 23-09-2005.
  18. Deberta: Decoding-enhanced BERT with disentangled attention. CoRR, abs/2006.03654.
  19. LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1650–1669, Dubrovnik, Croatia. Association for Computational Linguistics.
  20. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  21. Booksum: A collection of datasets for long-form narrative summarization.
  22. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  23. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  24. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  25. WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6826–6847, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  26. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  27. Roberta: A robustly optimized bert pretraining approach.
  28. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation.
  29. Chatgpt as a factual inconsistency evaluator for text summarization.
  30. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  31. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.
  32. Abstractive text summarization using sequence-to-sequence rnns and beyond.
  33. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
  34. The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Trans. Speech Lang. Process., 4(2):4–es.
  35. Adversarial NLI: A new benchmark for natural language understanding. CoRR, abs/1910.14599.
  36. F-coref: Fast, accurate and easy to use coreference resolution. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations, pages 48–56, Taipei, Taiwan. Association for Computational Linguistics.
  37. Does putting a linguist in the loop improve NLU data collection? In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4886–4901, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683.
  39. FactGraph: Evaluating factuality in summarization with semantic graph representations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3238–3253, Seattle, United States. Association for Computational Linguistics.
  40. QuestEval: Summarization asks for fact-based evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594–6604, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  41. Echoes from alexandria: A large resource for multilingual book summarization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 853–867, Toronto, Canada. Association for Computational Linguistics.
  42. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235.
  43. Large language models are not yet human-level evaluators for abstractive summarization.
  44. Evaluating the factual consistency of large language models through summarization.
  45. Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11626–11644, Toronto, Canada. Association for Computational Linguistics.
  46. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
  47. Is chatgpt a good nlg evaluator? a preliminary study.
  48. Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method.
  49. A broad-coverage challenge corpus for sentence understanding through inference. CoRR, abs/1704.05426.
  50. AlignScore: Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
  51. Shiyue Zhang and Mohit Bansal. 2021. Finding a balanced degree of automation for summary evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6617–6632, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  52. Bertscore: Evaluating text generation with bert.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Alessandro Scirè (10 papers)
  2. Karim Ghonim (2 papers)
  3. Roberto Navigli (35 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com