FREDSum: A Dialogue Summarization Corpus for French Political Debates (2312.04843v1)
Abstract: Recent advances in deep learning, and especially the invention of encoder-decoder architectures, has significantly improved the performance of abstractive summarization systems. The majority of research has focused on written documents, however, neglecting the problem of multi-party dialogue summarization. In this paper, we present a dataset of French political debates for the purpose of enhancing resources for multi-lingual dialogue summarization. Our dataset consists of manually transcribed and annotated political debates, covering a range of topics and perspectives. We highlight the importance of high quality transcription and annotations for training accurate and effective dialogue summarization models, and emphasize the need for multilingual resources to support dialogue summarization in non-English languages. We also provide baseline experiments using state-of-the-art methods, and encourage further research in this area to advance the field of dialogue summarization. Our dataset will be made publicly available for use by the research community.
- Political communities on twitter: Case study of the 2022 french presidential election. arXiv preprint arXiv:2204.07436.
- Linda M Collins and Clyde W Dent. 1988. Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivariate behavioral research, 23(2):231–242.
- Greekbart: The first pretrained greek sequence-to-sequence model. arXiv preprint arXiv:2304.00869.
- Jean-Philippe Fauconnier. 2015. French word embeddings.
- A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics.
- Emilio Ferrara. 2017. Disinformation and social bot operations in the run up to the 2017 french presidential election. arXiv preprint arXiv:1707.00086.
- SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
- The claire french dialogue dataset. arXiv preprint arXiv:2311.16840.
- The ICSI meeting corpus. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., volume 1, pages I–I. IEEE.
- BARThez: a skilled pretrained French sequence-to-sequence model. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9369–9390, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
- FlauBERT: Unsupervised language model pre-training for French. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2479–2490, Marseille, France. European Language Resources Association.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Hui Lin and Vincent Ng. 2019. Abstractive summarization: A survey of the state of the art. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9815–9822.
- The AMI meeting corpus. Int’l. Conf. on Methods and Techniques in Behavioral Research.
- Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, Barcelona, Spain. Association for Computational Linguistics.
- Derek Miller. 2019. Leveraging bert for extractive text summarization on lectures. arXiv preprint arXiv:1906.04165.
- Generating and validating abstracts of meeting conversations: a user study. In Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics.
- Using the omega index for evaluating abstractive community detection. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pages 10–18, Montréal, Canada. Association for Computational Linguistics.
- ELITR Minuting Corpus: A novel dataset for automatic minuting from multi-party meetings in English and Czech. In Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC-2022), Marseille, France. European Language Resources Association (ELRA). In print.
- Data-driven summarization of scientific articles. arXiv preprint arXiv:1804.08875.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- ALIGNMEET: A comprehensive tool for meeting annotation, alignment, and evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1771–1779, Marseille, France. European Language Resources Association.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Abstractive meeting summarization: A survey. Transactions of the Association for Computational Linguistics, 11:861–884.
- Packing the meeting summarization knapsack. In Ninth Annual Conference of the International Speech Communication Association.
- Guokan Shang. 2021. Spoken Language Understanding for Abstractive Meeting Summarization. Ph.D. thesis, Institut Polytechnique de Paris.
- Unsupervised abstractive meeting summarization with multi-sentence compression and budgeted submodular maximization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 664–674, Melbourne, Australia. Association for Computational Linguistics.
- Energy-based self-attentive learning of abstractive communities for spoken language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 313–327, Suzhou, China. Association for Computational Linguistics.
- Combining graph degeneracy and submodularity for unsupervised extractive summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 48–58, Copenhagen, Denmark. Association for Computational Linguistics.
- Vcsum: A versatile chinese meeting summarization dataset. arXiv preprint arXiv:2305.05280.
- Extractive is not faithful: An investigation of broad unfaithfulness problems in extractive summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2153–2174, Toronto, Canada. Association for Computational Linguistics.
- BERTScore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.