Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages (2303.12308v2)

Published 22 Mar 2023 in cs.CL

Abstract: Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for low resource (LR) languages a critical problem. Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summarization to coarsely identify salient information followed by a neural abstractive model to generate the section-specific text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Dhaval Taunk (3 papers)
  2. Shivprasad Sagare (4 papers)
  3. Anupam Patil (1 paper)
  4. Shivansh Subramanian (3 papers)
  5. Manish Gupta (67 papers)
  6. Vasudeva Varma (47 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.