WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario (2402.18264v2)
Abstract: It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under a real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with a fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
- Siddhartha Banerjee and Prasenjit Mitra. 2016. Wikiwrite: Generating wikipedia articles automatically. In IJCAI, pages 2740–2746.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051.
- Exploring in-context learning for knowledge grounded dialog generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10071–10081.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5055–5070, Online. Association for Computational Linguistics.
- Retrieval-generation synergy augmented large language models. arXiv preprint arXiv:2310.05149.
- Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore. Association for Computational Linguistics.
- Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pages 161–175, Dublin, Ireland. Association for Computational Linguistics.
- Raven: In-context learning with retrieval augmented encoder-decoder language models. arXiv preprint arXiv:2308.07922.
- Active retrieval augmented generation. arXiv preprint arXiv:2305.06983.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web, 6(2):167–195.
- Why the world reads wikipedia: Beyond english speakers. In Proceedings of the twelfth ACM international conference on web search and data mining, pages 618–626.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Generating wikipedia by summarizing long sequences.
- Query rewriting for retrieval-augmented large language models.
- Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34:4816–4828.
- Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus. arXiv preprint arXiv:2304.04358.
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
- Christina Sauper and Regina Barzilay. 2009. Automatically generating wikipedia articles: A structure-aware approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 208–216.
- Simple entity-centric questions challenge dense retrievers.
- Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation.
- Why we read wikipedia. Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
- Llama 2: Open foundation and fine-tuned chat models.
- Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509.
- Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
- Retrieve and refine: Improved sequence generation models for dialogue. arXiv preprint arXiv:1808.04776.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.