WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs (2209.13101v1)

Published 27 Sep 2022 in cs.CL

Abstract: As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many NLP tasks, such as information retrieval, knowledge base building, machine translation, text classification, and text summarization. In this paper, we introduce WikiDes, a novel dataset to generate short descriptions of Wikipedia articles for the problem of text summarization. The dataset consists of over 80k English samples on 6987 topics. We set up a two-phase summarization method - description generation (Phase I) and candidate ranking (Phase II) - as a strong approach that relies on transfer and contrastive learning. For description generation, T5 and BART show their superiority compared to other small-scale pre-trained models. By applying contrastive learning with the diverse input from beam search, the metric fusion-based ranking models outperform the direct description generation models significantly up to 22 ROUGE in topic-exclusive split and topic-independent split. Furthermore, the outcome descriptions in Phase II are supported by human evaluation in over 45.33% chosen compared to 23.66% in Phase I against the gold descriptions. In the aspect of sentiment analysis, the generated descriptions cannot effectively capture all sentiment polarities from paragraphs while doing this task better from the gold descriptions. The automatic generation of new descriptions reduces the human efforts in creating them and enriches Wikidata-based knowledge graphs. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions. Finally, we expect WikiDes to be a useful dataset for related works in capturing salient information from short paragraphs. The curated dataset is publicly available at: https://github.com/declare-lab/WikiDes.

Citations (11)

View on Semantic Scholar

Summary

The paper presents a two-phase methodology combining T5/BART-based description generation with contrastive learning for candidate ranking, achieving up to 22 ROUGE points improvement.
It leverages over 80,000 samples from 6,987 diverse Wikipedia topics, employing beam search to capture semantic nuances in short text synthesis.
The study confirms its effectiveness through quantitative metrics and human evaluations, highlighting practical benefits for automating Wikidata description updates.

Overview of "WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs"

The paper introduces WikiDes, a dataset curated for the task of generating concise descriptions from paragraphs, utilizing content extracted from Wikipedia and Wikidata. The dataset incorporates over 80,000 English samples spanning 6,987 topics, reflecting diverse domains represented in these massive information repositories. The authors propose a two-phase methodology for summarization, consisting of description generation followed by candidate ranking. The approach leverages the strengths of techniques like transfer learning and contrastive learning to optimize summary generation.

Methodological Framework

Phase I: Description Generation

In the first phase, the paper employs pre-trained models like T5 and BART, which have demonstrated robust capabilities in text generation tasks. These models are fine-tuned to decode paragraph representations into short descriptions. The choice of these models is driven by their architectural suitability for abstractive summarization tasks, particularly their ability to capture semantic nuances within large training samples. The process employs beam search to generate varied candidate descriptions, thus enriching the pool of output for subsequent refinement.

Phase II: Candidate Ranking

The second phase introduces a ranking mechanism to discern the most appropriate description from the generated candidates. Through contrastive learning, this phase enhances the description's relevance and coherence by modeling the similarity between generated candidates and reference outputs. Metrics such as cosine similarity and ROUGE scores are harmonized to assess description quality, ultimately selecting the most lexically and semantically aligned output.

Key Findings and Results

The proposed two-phase method shows a substantial improvement over baseline models, particularly in metrics like ROUGE, BertScore, and BLEU. Impressively, the candidate ranking phase markedly enhances results, achieving gains of up to 22 ROUGE points in various test splits, including both topic-independent and topic-exclusive partitions. Furthermore, human evaluation corroborates these quantitative metrics, with descriptive elements generated in Phase II being preferred over those from Phase I.

Sentiment Analysis and Practical Utility

The paper also explores the generated descriptions' sentiment analysis, revealing limitations in capturing sentiment nuances present in original paragraphs. While the focus is largely on neutrality — reflecting Wikipedia's editorial norms — the research acknowledges the gap in sentiment polarity capture, suggesting a need for future enhancement in sentiment-aware summarization.

The practical applicability of WikiDes is highlighted in its potential to ameliorate the editorial workload associated with updating Wikidata descriptions. Given the evolving nature of encyclopedic content, automating the description generation could significantly streamline maintaining up-to-date and comprehensive knowledge graphs.

Implications and Future Directions

This research opens avenues for advancing automatic summarization models by demonstrating the benefits of integrating multiple learning paradigms. The findings suggest potential cross-linguistic scalability, although the paper primarily explores English data. The methodologies outlined could be extended toward multilingual datasets, enhancing WikiDes's utility across broader linguistic contexts.

In consideration of sentiment analysis challenges revealed in this paper, future work could integrate sentiment modulation techniques, refining the model's capacity to reflect varying emotional undertones inherent in source texts. Additionally, continuous advancements in transfer learning models may offer pathways to further elevate the descriptive granularity and accuracy achieved in this project.

The WikiDes dataset serves not only as a significant scholarly resource for summarization tasks but also as a testament to the collaborative potential between human and machine intelligence in managing expansive knowledge repositories.

PDF Markdown

Related Papers

GitHub

GitHub - declare-lab/WikiDes: A Wikipedia-based summarization dataset (13 stars)