Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset (2403.03750v2)

Published 6 Mar 2024 in cs.CL and cs.AI
German also Hallucinates! Inconsistency Detection in News Summaries with the Absinth Dataset

Abstract: The advent of LLMs has led to remarkable progress on a wide range of natural language processing tasks. Despite the advances, these large-sized models still suffer from hallucinating information in their output, which poses a major issue in automatic text summarization, as we must guarantee that the generated summary is consistent with the content of the source document. Previous research addresses the challenging task of detecting hallucinations in the output (i.e. inconsistency detection) in order to evaluate the faithfulness of the generated summaries. However, these works primarily focus on English and recent multilingual approaches lack German data. This work presents absinth, a manually annotated dataset for hallucination detection in German news summarization and explores the capabilities of novel open-source LLMs on this task in both fine-tuning and in-context learning settings. We open-source and release the absinth dataset to foster further research on hallucination detection in German.

Inconsistency Detection in German News Summaries: Introducing the ABSINTH Dataset

Overview of the ABSINTH Dataset

The ABSINTH dataset represents a significant advancement in the field of NLP, specifically focusing on the detection of hallucinations in German news article summaries. The dataset comprises 4,314 manually annotated pairs of news articles and their corresponding summaries. These annotations differentiate between two types of hallucinations: intrinsic, where the summary contains counterfactual information, and extrinsic, where additional, unverifiable information is added to the summary. The inclusion of summaries generated by a range of models, from pre-trained LLMs specifically fine-tuned for German summarization to the latest prompt-based LLMs like GPT-4 and LLama 2, makes this dataset a comprehensive resource for studying summarization inconsistencies in the German language.

Dataset Construction and Annotation

The dataset construction involved a careful selection of 200 articles from the "20Minuten" test split, ensuring a variety of source material. Summaries were generated using an array of models, including multilingual transformer-based models and cutting-edge LLMs, with a focus on producing diverse examples of factual consistency and hallucinations. The annotation process was meticulously designed to ensure the high quality of the labels assigned to each summary sentence. This involved training and continuous monitoring of annotators, ensuring an agreement level indicative of reliable data quality.

Experimentation with Open-Source LLMs

The paper reports on experiments assessing the effectiveness of recent open-source LLMs in detecting inconsistencies in the generated summaries. This involved not only fine-tuning these models on the ABSINTH dataset but also exploring their performance in zero-shot and few-shot settings. Among the significant findings, mBERT emerged as a particularly effective model for this task, outperforming other LLMs in detecting both intrinsic and extrinsic hallucinations.

Implications and Future Directions

The ABSINTH dataset opens new avenues for research into inconsistency detection in non-English languages, addressing a gap in the field of NLP. The experiments conducted demonstrate the potential of fine-tuning well-established transformer models like mBERT for this purpose. However, they also highlight the challenges faced by state-of-the-art LLMs in generalizing their performance to tasks like hallucination detection. Future work may explore the development of more sophisticated models or annotation techniques that can further enhance the accuracy and reliability of inconsistency detection.

Conclusion

The creation of the ABSINTH dataset marks a significant step towards understanding and improving the quality of automated news article summarization in German. It provides a valuable resource for developing and benchmarking NLP models capable of identifying inconsistencies in text summarization. The research presented not only sheds light on the current capabilities and limitations of existing models but also sets the stage for future advancements in the domain of trustworthy and reliable automatic text summarization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Multilingual summarization with factual consistency evaluation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3562–3591, Toronto, Canada. Association for Computational Linguistics.
  2. Can we trust the evaluation on ChatGPT? In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 47–54, Toronto, Canada. Association for Computational Linguistics.
  3. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  4. SEAHORSE: A multilingual, multifaceted dataset for summarization evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9397–9413, Singapore. Association for Computational Linguistics.
  5. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  6. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5055–5070, Online. Association for Computational Linguistics.
  9. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics.
  10. J.L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382.
  11. A framework for few-shot language model evaluation.
  12. TrueTeacher: Learning factual consistency evaluation with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2053–2070, Singapore. Association for Computational Linguistics.
  13. Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1449–1462, Online. Association for Computational Linguistics.
  14. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 708–719, New Orleans, Louisiana. Association for Computational Linguistics.
  15. A retrospective analysis of the fake news challenge stance-detection task. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1859–1874, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  16. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
  17. The factual inconsistency problem in abstractive text summarization: A survey. arXiv preprint arXiv:2104.14839.
  18. Mistral 7b. arXiv preprint arXiv:2310.06825.
  19. All the news that’s fit to fabricate: AI-generated text as a tool of media misinformation. Journal of Experimental Political Science, 9(1):104–117.
  20. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  21. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  22. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, Singapore. Association for Computational Linguistics.
  23. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  24. Chatgpt as a factual inconsistency evaluator for text summarization. arXiv preprint arXiv:2303.15621.
  25. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  26. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  27. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  28. Detecting and mitigating hallucinations in multilingual summarisation. arXiv preprint arXiv:2305.13632.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  31. A new dataset and efficient baselines for document-level text simplification in German. In Proceedings of the Third Workshop on New Frontiers in Summarization, pages 152–161, Online and in Dominican Republic. Association for Computational Linguistics.
  32. Say the right thing right: Ethics issues in natural language generation systems. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 103–108, Valencia, Spain. Association for Computational Linguistics.
  33. Text classification via large language models. arXiv preprint arXiv:2305.08377.
  34. Evaluating the factual consistency of large language models through news summarization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5220–5255, Toronto, Canada. Association for Computational Linguistics.
  35. Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data. NeuroImage, 277.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  37. mLongT5: A multilingual and efficient text-to-text transformer for longer sequences. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9380–9386, Singapore. Association for Computational Linguistics.
  38. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Advances in Neural Information Processing Systems.
  39. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  40. A survey of large language models. arXiv preprint arXiv:2303.18223.
  41. doccano. distributed via github, 1.8.4. PID https://github.com/doccano/doccano.
  42. 20Minuten: A Multi-task News Summarisation Dataset for German. Department of Computational Linguistics, University of Zurich, distributed via github, 1.0. PID https://github.com/ZurichNLP/20Minuten.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Laura Mascarell (3 papers)
  2. Ribin Chalumattu (2 papers)
  3. Annette Rios (10 papers)