Papers
Topics
Authors
Recent
2000 character limit reached

WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario (2402.18264v2)

Published 28 Feb 2024 in cs.CL

Abstract: It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under a real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with a fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  2. Siddhartha Banerjee and Prasenjit Mitra. 2016. Wikiwrite: Generating wikipedia articles automatically. In IJCAI, pages 2740–2746.
  3. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051.
  6. Exploring in-context learning for knowledge grounded dialog generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10071–10081.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5055–5070, Online. Association for Computational Linguistics.
  9. Retrieval-generation synergy augmented large language models. arXiv preprint arXiv:2310.05149.
  10. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore. Association for Computational Linguistics.
  11. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627.
  12. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  13. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  14. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pages 161–175, Dublin, Ireland. Association for Computational Linguistics.
  15. Raven: In-context learning with retrieval augmented encoder-decoder language models. arXiv preprint arXiv:2308.07922.
  16. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983.
  17. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  18. Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web, 6(2):167–195.
  19. Why the world reads wikipedia: Beyond english speakers. In Proceedings of the twelfth ACM international conference on web search and data mining, pages 618–626.
  20. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  21. A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110.
  22. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  23. Generating wikipedia by summarizing long sequences.
  24. Query rewriting for retrieval-augmented large language models.
  25. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
  26. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  27. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  28. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  30. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  31. Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34:4816–4828.
  32. Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus. arXiv preprint arXiv:2304.04358.
  33. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
  34. Christina Sauper and Regina Barzilay. 2009. Automatically generating wikipedia articles: A structure-aware approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 208–216.
  35. Simple entity-centric questions challenge dense retrievers.
  36. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation.
  37. Why we read wikipedia. Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
  38. Llama 2: Open foundation and fine-tuned chat models.
  39. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509.
  40. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
  41. Retrieve and refine: Improved sequence generation models for dialogue. arXiv preprint arXiv:1808.04776.
  42. Judging llm-as-a-judge with mt-bench and chatbot arena.
Citations (2)

Summary

  • The paper introduces WikiGenBen to benchmark retrieval-based techniques for generating comprehensive Wikipedia articles for emergent events.
  • It leverages a robust dataset of 309 emergent events with paired documents to simulate real-world scenarios and evaluate content quality.
  • Innovative metrics such as Fluent Score and Citation Precision demonstrate significant improvements in fluency, informativeness, and faithfulness.

Exploring the Frontier of Retrieval-Based Full-Length Wikipedia Generation for Emergent Events

Introduction

In the evolving landscape of generative AI and LLMs, the task of automatically generating structured, full-length Wikipedia articles for emergent events poses an exciting challenge. This task extends beyond generating mere text snippets or summaries; it encompasses the generation of comprehensive documents that are structured, factual, and up-to-date, using information that spans multiple web sources. In response to this challenge, a recent study introduces a novel benchmark, WikiGenBen, focused on simulating real-world scenarios where such Wikipedia documents are generated through retrieval techniques.

Task Definition

The core objective is to generate Wikipedia articles for emergent events based on related documents retrieved from a vast web corpus. By setting conditions that ensure the events are recent and thus unlikely to have been seen by the LLMs during their pre-training phase, the study addresses critical concerns related to data leakage and model validity. The task intricately involves the generation of structured content, including titles, introductions, body texts, and references, enhancing the complexity and applicability of the generated documents.

WikiGenBen Benchmark

WikiGenBen emerges as a comprehensive dataset of 309 emergent events, equipped with paired related documents obtained through robust retrieval processes. This benchmark not only focuses on the accurate and factual generation of Wikipedia articles but also explores systematic evaluation metrics that holistically assess fluency, informativeness, and faithfulness of the generated content. Its underlying methodology and structure resonate with real-world application scenarios, striving to bridge gaps existing in previous Wikipedia generation studies.

Evaluation Metrics

A trio of dimensions—fluency, informativeness, and faithfulness—serve as the bedrock for evaluating the performance of Wikipedia generation systems. Notable innovations include the introduction of neoteric metrics such as Fluent Score, Outline Score, Focus Score, Info Score, and IB Score, each tailored to dissect different facets of content generation. The use of GPT-4 in assessing text fluency and informativeness underscores the leverage of cutting-edge AI capabilities, while methods like Citation Rate, Citation Recall, and Citation Precision outline a nuanced approach to gauging content faithfulness and relevance.

Baseline Methods and Experimentation

The study benchmarks the inaugural task against prevailing models and methodologies, including the "Retrieve-then-Read" and "Retrieve-Plan-Retrieve-Read" paradigms. These frameworks underscore the significance of effectively combining retrieval and generation phases to enhance the depth, accuracy, and structure of the generated Wikipedia articles. Experimental insights reveal marked improvements in content quality through meticulous planning and retrieval strategies, aligning with the overarching goals of the task.

Implications and Future Directions

This study stands at the convergence of retrieval techniques and generative AI, propelling forward the capabilities to generate full-length, well-structured Wikipedia documents for emergent events. It not only highlights the potential and challenges in achieving such an ambitious goal but also paves avenues for future enhancements, particularly in optimizing the retrieval process and refining generation methodologies. The profound implications of this research span both theoretical and practical realms, offering insights into advancing AI's role in knowledge dissemination and management in the digital era.

In summary, this exploration into retrieval-based full-length Wikipedia generation for emergent events initiates a dialogue on the intersections of retrieval techniques, structured content generation, and LLMs. As the digital landscape continues to burgeon with information, endeavors such as the WikiGenBen benchmark and accompanying methodologies provide crucial stepping stones towards more intelligent, accurate, and timely generation of knowledge-based content.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 16 likes about this paper.