Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization (2010.03093v1)

Published 7 Oct 2020 in cs.CL

Abstract: We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct crosslingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost efficient during inference.

Overview of \wiki: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

The paper introduces \wiki, a comprehensive benchmark dataset designed for cross-lingual abstractive summarization evaluation. This resource comprises article-summary pairs drawn from WikiHow, spanning 18 languages, thereby addressing the previous scarcity of large-scale, multilingual summarization datasets. The uniqueness of \wiki lies in its meticulous construction, where image-based step alignments ensure high-quality gold-standard article-summary pairs. This method overcomes traditional data collection challenges that arise from translation and content alignment ambiguities.

Dataset Characteristics and Contributions

The authors provide an extensive dataset that not only surpasses traditional datasets in size but also in language diversity. \wiki consists of 141,457 English articles, with parallel summaries in other languages, where each language features an average of 42,783 aligned articles. The multi-lingual nature of the dataset allows for comprehensive evaluation and research in cross-lingual and multilingual settings. Existing multilingual summarization datasets such as MultiLing and Global Voices fall short in comparison due to their limited article coverage and absence of parallel summaries across languages.

The paper also details the data collection process, emphasizing the use of human-written and reviewed content from WikiHow—a well-known repository of how-to guides—as a reliable source. This choice ensures content quality, providing researchers with a dependable benchmark for developing sophisticated summarization methods.

Baseline Evaluations and Proposed Method

The authors evaluate the performance of several existing cross-lingual summarization approaches, including the translate-then-summarize and summarize-then-translate paradigms. These have been the traditional approaches due to their reliance on monolingual translation models and summarization resources. However, each of these carries inherent flaws, primarily due to error propagation from translation processes and efficiency challenges at inference time.

To counter these inefficiencies, the paper proposes a direct cross-lingual summarization approach. This method incorporates synthetic data generated through machine translation coupled with pre-training via Neural Machine Translation (NMT). The proposed technique bypasses the need for translation during inference, leading to a significant reduction in latency and cost. Empirical results show that this method not only enhances performance over the baseline models but also delivers a cost-effective solution for real-world applications.

Implications and Future Directions

The development of the \wiki dataset marks an advance in the field of natural language processing, particularly for tasks requiring cross-lingual capabilities. Its impact extends beyond summarization. The parallel summaries in 18 languages open pathways for expansive research in multilingual text processing, machine translation, and even alignment and co-reference resolution across languages.

The practical implications of this research lie in its potential applications in creating more equitable access to information and knowledge sharing across linguistic barriers. As organizations continue to embrace global outreach, tools trained on datasets like \wiki could bridge information divides and enhance cross-lingual communications.

From a theoretical standpoint, this work invites further investigation into zero-shot and few-shot learning methods within multilingual contexts. Future research could explore the robustness of abstractive summarization models when extended to low-resource languages, making crucial strides towards inclusive AI development.

In summary, the paper presents a significant contribution to the field of cross-lingual summarization by providing the \wiki dataset—a robust, multilingual resource superior in scale and quality to its predecessors—and demonstrating a novel approach to direct cross-lingual summarization that holds promise for both scientific advancement and practical application.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Faisal Ladhak (31 papers)
  2. Esin Durmus (38 papers)
  3. Claire Cardie (74 papers)
  4. Kathleen McKeown (85 papers)
Citations (186)
Github Logo Streamline Icon: https://streamlinehq.com