WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario

Published 28 Feb 2024 in cs.CL | (2402.18264v2)

Abstract: It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under a real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with a fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.

Abstract PDF HTML Upgrade to Chat

References (42)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces WikiGenBen to benchmark retrieval-based techniques for generating comprehensive Wikipedia articles for emergent events.
It leverages a robust dataset of 309 emergent events with paired documents to simulate real-world scenarios and evaluate content quality.
Innovative metrics such as Fluent Score and Citation Precision demonstrate significant improvements in fluency, informativeness, and faithfulness.

Exploring the Frontier of Retrieval-Based Full-Length Wikipedia Generation for Emergent Events

Introduction

In the evolving landscape of generative AI and LLMs, the task of automatically generating structured, full-length Wikipedia articles for emergent events poses an exciting challenge. This task extends beyond generating mere text snippets or summaries; it encompasses the generation of comprehensive documents that are structured, factual, and up-to-date, using information that spans multiple web sources. In response to this challenge, a recent study introduces a novel benchmark, WikiGenBen, focused on simulating real-world scenarios where such Wikipedia documents are generated through retrieval techniques.

Task Definition

The core objective is to generate Wikipedia articles for emergent events based on related documents retrieved from a vast web corpus. By setting conditions that ensure the events are recent and thus unlikely to have been seen by the LLMs during their pre-training phase, the study addresses critical concerns related to data leakage and model validity. The task intricately involves the generation of structured content, including titles, introductions, body texts, and references, enhancing the complexity and applicability of the generated documents.

WikiGenBen Benchmark

WikiGenBen emerges as a comprehensive dataset of 309 emergent events, equipped with paired related documents obtained through robust retrieval processes. This benchmark not only focuses on the accurate and factual generation of Wikipedia articles but also explores systematic evaluation metrics that holistically assess fluency, informativeness, and faithfulness of the generated content. Its underlying methodology and structure resonate with real-world application scenarios, striving to bridge gaps existing in previous Wikipedia generation studies.

Evaluation Metrics

A trio of dimensions—fluency, informativeness, and faithfulness—serve as the bedrock for evaluating the performance of Wikipedia generation systems. Notable innovations include the introduction of neoteric metrics such as Fluent Score, Outline Score, Focus Score, Info Score, and IB Score, each tailored to dissect different facets of content generation. The use of GPT-4 in assessing text fluency and informativeness underscores the leverage of cutting-edge AI capabilities, while methods like Citation Rate, Citation Recall, and Citation Precision outline a nuanced approach to gauging content faithfulness and relevance.

Baseline Methods and Experimentation

The study benchmarks the inaugural task against prevailing models and methodologies, including the "Retrieve-then-Read" and "Retrieve-Plan-Retrieve-Read" paradigms. These frameworks underscore the significance of effectively combining retrieval and generation phases to enhance the depth, accuracy, and structure of the generated Wikipedia articles. Experimental insights reveal marked improvements in content quality through meticulous planning and retrieval strategies, aligning with the overarching goals of the task.

Implications and Future Directions

This study stands at the convergence of retrieval techniques and generative AI, propelling forward the capabilities to generate full-length, well-structured Wikipedia documents for emergent events. It not only highlights the potential and challenges in achieving such an ambitious goal but also paves avenues for future enhancements, particularly in optimizing the retrieval process and refining generation methodologies. The profound implications of this research span both theoretical and practical realms, offering insights into advancing AI's role in knowledge dissemination and management in the digital era.

In summary, this exploration into retrieval-based full-length Wikipedia generation for emergent events initiates a dialogue on the intersections of retrieval techniques, structured content generation, and LLMs. As the digital landscape continues to burgeon with information, endeavors such as the WikiGenBen benchmark and accompanying methodologies provide crucial stepping stones towards more intelligent, accurate, and timely generation of knowledge-based content.

Markdown