Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages (2106.13822v1)

Published 25 Jun 2021 in cs.CL
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

Abstract: Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model, with XL-Sum and experiment on multilingual and low-resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10 languages we benchmark on, with some of them exceeding 15, as obtained by multilingual training. Additionally, training on low-resource languages individually also provides competitive performance. To the best of our knowledge, XL-Sum is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and the number of languages covered. We are releasing our dataset and models to encourage future research on multilingual abstractive summarization. The resources can be found at \url{https://github.com/csebuetnlp/xl-sum}.

Overview of XL-Sum: A Comprehensive Multilingual Abstractive Summarization Dataset

The research paper introduces XL-Sum, an extensive dataset designed for multilingual abstractive summarization, covering 44 languages. This dataset addresses a significant gap in existing resources, as most previous efforts have concentrated on high-resource languages, leaving many low and mid-resource languages without adequate data for model training and evaluation. The dataset consists of 1 million professionally annotated article-summary pairs extracted from BBC articles, employing heuristics to ensure high quality and abstraction in the summaries. This paper's primary contribution is not only the creation of XL-Sum but also the provision of a standard resource for researchers focusing on low-resource languages.

Dataset Characteristics and Contributions

XL-Sum is distinguished by its scale and scope, as it is the largest known abstractive summarization dataset in the number of languages covered and the volume of data from a single source. The dataset's languages range from widely spoken ones such as English, Hindi, and Chinese to lesser-resourced languages like Azerbaijani and Amharic. Notably, for many constituent languages, XL-Sum provides the first publicly available summarization dataset, marking a crucial step towards enabling research and development for these linguistic communities.

The dataset's summaries are highly abstractive. They are generated through a structured methodology, utilizing BBC's consistent editorial styling where summaries are given upfront in bold text. This approach enables the authors to collect summaries that effectively distill articles' main points without directly lifting text, encouraging quality that is both concise and coherent. The dataset is notably extensible, allowing for growth as more data becomes available.

Empirical Evaluation and Baselines

The paper presents comprehensive empirical evaluations using the mT5 model, a state-of-the-art multilingual sequence-to-sequence model. The authors perform experiments on multilingual and low-resource scenarios, establishing several benchmarks. In multilingual settings, where a single model handles all 44 languages, the model achieves ROUGE-2 scores exceeding 11 across all evaluated languages, which is competitive considering the diverse linguistic dataset. For instance, in English, comparable performance is observed with existing benchmarks derived from XSum, a dataset with similar characteristics but limited to English.

The experiments also extend to low-resource settings, demonstrating that fine-tuning models on individual languages—even those with fewer data samples—produces results closely aligned with multilingual models. This finding highlights XL-Sum's utility even in constrained computational environments and reaffirms its potential to foster advancements in under-researched languages.

Implications and Future Directions

The implications of XL-Sum are twofold: practical and theoretical. Practically, the dataset enables the development of robust multilingual summarization models, fostering technological inclusivity for languages previously considered under-resourced. Theoretically, it provides a fertile ground for research into multilingual and cross-lingual NLP models, enabling studies of positive transfer among linguistically similar languages.

Looking forward, this dataset could inspire studies in cross-lingual summarization, where models are trained on one language and evaluated on another, extending the reach of summarization capabilities to more global contexts. Additionally, the approach of utilizing consistent data sources for cross-linguistic dataset creation may serve as a model for future dataset compilation efforts in NLP.

In conclusion, XL-Sum represents a significant advancement in multilingual abstractive summarization datasets. Its comprehensive nature not only fills a critical gap for resource-constrained languages but also paves the way for inclusive and wide-reaching NLP research. The release of this dataset is likely to catalyze further studies and applications for languages that have historically been underrepresented in computational linguistics.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tahmid Hasan (10 papers)
  2. Abhik Bhattacharjee (12 papers)
  3. Md Saiful Islam (107 papers)
  4. Kazi Samin (3 papers)
  5. Yuan-Fang Li (90 papers)
  6. Yong-Bin Kang (10 papers)
  7. M. Sohel Rahman (52 papers)
  8. Rifat Shahriyar (25 papers)
Citations (306)
X Twitter Logo Streamline Icon: https://streamlinehq.com