Papers
Topics
Authors
Recent
2000 character limit reached

PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

Published 15 May 2023 in cs.CL | (2305.08828v2)

Abstract: This paper introduces PMIndiaSum, a multilingual and massively parallel summarization corpus focused on languages in India. Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs. We detail our construction workflow including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks for monolingual, cross-lingual, and multilingual summarization by fine-tuning, prompting, as well as translate-and-summarize. Experimental results confirm the crucial role of our data in aiding summarization between Indian languages. Our dataset is publicly available and can be freely modified and re-distributed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Varta: A large-scale headline-generation dataset for Indic languages. In Findings of the Association for Computational Linguistics: ACL 2023.
  2. EUR-lex-sum: A multi- and cross-lingual dataset for long-form summarization in the legal domain. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  3. CrossSum: Beyond English-centric cross-lingual summarization for 1,500+ language pairs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics.
  4. Rishi Bommasani and Claire Cardie. 2020. Intrinsic evaluation of summarization datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
  5. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. lmsys.org.
  6. SEAHORSE: A multilingual, multifaceted dataset for summarization evaluation. arXiv preprint.
  7. IndicBART: A pre-trained model for indic natural language generation. In Findings of the Association for Computational Linguistics: ACL 2022.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  9. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.
  10. Datasheets for datasets. Commun. ACM, 64(12).
  11. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  12. Barry Haddow and Faheem Kirefu. 2020. PMIndia—A collection of parallel corpora of languages of India. arXiv preprint.
  13. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
  14. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems.
  15. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning.
  16. Terry K Koo and Mae Y Li. 2016. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2).
  17. IndicNLG benchmark: Multilingual datasets for diverse NLG tasks in Indic languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.
  18. Anoop Kunchukuttan. 2020. The IndicNLP Library. github.com.
  19. WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020.
  20. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out.
  21. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning.
  22. Annotated Gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction.
  23. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
  24. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
  25. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers.
  26. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
  27. FIRE 2022 ILSUM track: Indian language summarization. In Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation.
  28. MLSUM: The multilingual summarization corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.
  29. Patrick E Shrout and Joseph L Fleiss. 1979. Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 86(2).
  30. Multilingual translation from denoising pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
  31. Stanford Alpaca: An instruction-following LLaMA model. github.com.
  32. Dhaval Taunk and Vasudeva Varma. 2022. Summarizing indian languages using multilingual transformers based models. In Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation.
  33. LLaMA: Open and efficient foundation language models. arXiv preprint.
  34. Indian language summarization using pretrained sequence-to-sequence models. In Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation.
  35. TeSum: Human-generated abstractive summarization corpus for Telugu. In Proceedings of the Thirteenth Language Resources and Evaluation Conference.
  36. Daniel Varab and Natalie Schluter. 2021. MassiveSumm: a very large-scale, very multilingual, news summarisation dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  37. Large scale multi-lingual multi-modal summarization dataset. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics.
  38. A Survey on Cross-Lingual Summarization. Transactions of the Association for Computational Linguistics, 10:1304–1323.
  39. Benchmarking large language models for news summarization. arXiv preprint.
  40. Zheng Zhao and Pinzhen Chen. 2022. To adapt or to fine-tune: A case study on abstractive summarization. In Proceedings of the 21st Chinese National Conference on Computational Linguistics.
  41. Reducing quantity hallucinations in abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020.
  42. NCLS: Neural cross-lingual summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.