Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
112 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models (2402.14207v2)

Published 22 Feb 2024 in cs.CL and cs.AI

Abstract: We study how to apply LLMs to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages. This underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing. We propose STORM, a writing system for the Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking. STORM models the pre-writing stage by (1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline. For evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage. We further gather feedback from experienced Wikipedia editors. Compared to articles generated by an outline-driven retrieval-augmented baseline, more of STORM's articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%). The expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. In-context examples selection for machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8857–8873, Toronto, Canada. Association for Computational Linguistics.
  2. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59, Minneapolis, Minnesota. Association for Computational Linguistics.
  3. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the 42nd international acm sigir conference on research and development in information retrieval, pages 475–484.
  4. Expository text generation: Imitate, retrieve, paraphrase. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11896–11919, Singapore. Association for Computational Linguistics.
  5. Siddhartha Banerjee and Prasenjit Mitra. 2015. WikiKreator: Improving Wikipedia stubs automatically. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 867–877, Beijing, China. Association for Computational Linguistics.
  6. Attributed question answering: Evaluation and modeling for attributed large language models.
  7. The craft of research. University of Chicago press.
  8. Laura Dietz and John Foley. 2019. Trec car y3: Complex answer retrieval overview. In Proceedings of Text REtrieval Conference (TREC).
  9. Christina S Doyle. 1994. Information literacy in an information society: A concept for the information age. Diane Publishing.
  10. Ann-Marie Eriksson and Åsa Mäkitalo. 2015. Supervision at the outline stage: Introducing and encountering issues of sustainable development through academic writing assignments. Text & Talk, 35(2):123–153.
  11. Angela Fan and Claire Gardent. 2022. Generating biographies on Wikipedia: The impact of gender bias on the retrieval-based generation of women biographies. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8561–8576, Dublin, Ireland. Association for Computational Linguistics.
  12. Topic-to-essay generation with neural networks. In IJCAI, pages 4078–4084.
  13. Tira Nur Fitria. 2023. Artificial intelligence (ai) technology in openai chatgpt application: A review of chatgpt in writing english essay. In ELT Forum: Journal of English Language Teaching, volume 12, pages 44–58.
  14. Pasi Fränti and Radu Mariescu-Istodor. 2023. Soft precision and recall. Pattern Recognition Letters, 167:115–121.
  15. Stakeholder theory: The state of the art.
  16. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore. Association for Computational Linguistics.
  17. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.
  18. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43.
  19. Mistral 7b. arXiv preprint arXiv:2310.06825.
  20. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, Singapore. Association for Computational Linguistics.
  21. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR.
  22. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP. arXiv preprint arXiv:2212.14024.
  23. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714.
  24. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
  25. Internet-augmented dialogue generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8460–8478, Dublin, Ireland. Association for Computational Linguistics.
  26. LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1650–1669, Dubrovnik, Croatia. Association for Computational Linguistics.
  27. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  28. Unified demonstration retriever for in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4644–4668, Toronto, Canada. Association for Computational Linguistics.
  29. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  30. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics.
  31. Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations.
  32. Teaching language models to support answers with verified quotes.
  33. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  34. Semi-automatic generation of a corpus of wikipedia articles on science and technology. Profesional de la Información, 26(5):995–1005.
  35. Rosa Munoz-Luna. 2015. Main ingredients for success in l2 academic writing: Outlining, drafting and proofreading. PloS one, 10(6):e0128309.
  36. Webgpt: Browser-assisted question-answering with human feedback.
  37. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  38. Talm: Tool augmented language models.
  39. John V Pavlik. 2023. Collaborating with chatgpt: Considering the implications of generative artificial intelligence for journalism and media education. Journalism & Mass Communication Educator, 78(1):84–93.
  40. Synchromesh: Reliable code generation from pre-trained language models. In International Conference on Learning Representations.
  41. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, Singapore. Association for Computational Linguistics.
  42. Stay hungry, stay focused: Generating informative and specific questions in information-seeking conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 25–40, Online. Association for Computational Linguistics.
  43. Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus.
  44. A survey on asking clarification questions datasets in conversational systems. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2698–2716, Toronto, Canada. Association for Computational Linguistics.
  45. Ashwin Ram. 1991. A theory of questions and question asking. Journal of the Learning Sciences, 1(3-4):273–318.
  46. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics.
  47. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  48. D Gordon Rohman. 1965. Pre-writing the stage of discovery in the writing process. College composition and communication, 16(2):106–112.
  49. Christina Sauper and Regina Barzilay. 2009. Automatically generating Wikipedia articles: A structure-aware approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 208–216, Suntec, Singapore. Association for Computational Linguistics.
  50. WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2387–2413, Singapore. Association for Computational Linguistics.
  51. Beyond summarization: Designing ai support for real-world expository writing tasks.
  52. Nearest neighbor zero-shot inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3254–3265, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  53. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 373–393, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  54. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  55. Christine M Tardy. 2010. Writing for the world: Wikipedia as an introduction to academic writing. In English teaching forum, volume 48, page 12. ERIC.
  56. Role of questions in inquiry-based instruction: towards a design taxonomy for question-asking and implications for design. Educational Technology Research and Development, 68:653–678.
  57. Charles A Weaver III and Walter Kintsch. 1991. Expository text.
  58. Karsten Wenzlaff and Sebastian Spaeth. 2022. Smarter than humans? validating how openai’s chatgpt model explains crowdfunding, alternative finance and community finance. Validating how OpenAI’s ChatGPT model explains Crowdfunding, Alternative Finance and Community Finance.(December 22, 2022).
  59. A critical evaluation of evaluations for long-form question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3225–3245, Toronto, Canada. Association for Computational Linguistics.
  60. DOC: Improving long story coherence with detailed outline control. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3378–3465, Toronto, Canada. Association for Computational Linguistics.
  61. Re3: Generating longer stories with recursive reprompting and revision. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4393–4479, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  62. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
  63. Almanac: Retrieval-augmented language models for clinical medicine. Research Square.
  64. Docprompting: Generating code by retrieving the docs. In The Eleventh International Conference on Learning Representations.
Citations (14)

Summary

  • The paper introduces STORM, a novel system that automates the pre-writing phase for Wikipedia-like article creation by simulating multi-perspective discussions.
  • Utilizing the FreshWiki dataset, STORM achieves a 25% increase in article organization and a 10% improvement in coverage breadth compared to baseline models.
  • The approach paves the way for enhanced automated content generation while addressing challenges in neutrality, bias management, and fact-checking.

Automating Pre-writing for Wikipedia-like Article Generation with STORM

Introduction

STORM (Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking) presents a noteworthy advancement in the field of LLMs and their application in generating long-form, informative content akin to Wikipedia articles. The core challenges addressed include researching a given topic effectively and forming a structured outline for comprehensive article writing—a task that, until now, has combined significant human effort with the advanced capabilities of LLMs.

The FreshWiki Dataset

To underpin their research, the authors introduce the FreshWiki dataset, a curated collection of recent, high-quality Wikipedia articles. This dataset serves a dual purpose: first, as a benchmark for evaluating the performance of STORM against existing article-generation methodologies; and second, to ensure that the challenge of data leakage, common when training on older, widely available datasets, is mitigated. By focusing on articles edited or created post the training cut-off for most LLMs, FreshWiki provides a fresh and relevant foundation for this paper.

Methodological Overview: STORM

STORM represents a systematic approach to automating the pre-writing stage, which is crucial yet often underexplored. The methodology harnesses LLMs to:

  • Identify diverse perspectives surrounding a topic by analyzing similar subjects.
  • Simulate in-depth, multi-perspective dialogue to question a "topic expert," leveraging the internet as a trusted source for answers.
  • Curate this information into a coherent outline, from which a detailed article can be sequentially constructed.

This process is meticulously designed to mimic the human approach to topic exploration, questioning, and structured writing, transitioning from a generalized understanding to a detailed exposition.

Evaluation Results and Implications

The performance of STORM is rigorously evaluated against an array of metrics and benchmarks, including the novel FreshWiki dataset. The system demonstrates a notable improvement in the organization and breadth of coverage compared to baseline models. Specifically, STORM articles show a 25% absolute increase in organization and a 10% increase in breadth of coverage. These outcomes underscore the potential of STORM not only to elevate the quality of automated content creation but also to serve as a valuable tool for exploring and understanding complex topics.

Challenges and Future Directions

Despite STORM's advancements, the paper candidly discusses the limitations and emerging challenges of automated long-form article generation. Notably, biases inherent in internet sources, the tendency of LLMs to forge connections between unrelated facts, and the pursuit of neutrality and verifiability in automated writing are highlighted as areas requiring further research. These issues underscore the nuanced differences between human and machine understanding, as well as the complexities of accurately reflecting multifaceted real-world information through automated processes.

Conclusion

In summary, STORM represents a significant step forward in the automation of the pre-writing stage for Wikipedia-like article generation. By effectively leveraging LLMs for detailed research, question-asking, and outline creation, STORM enhances the capability of machines to create organized, informative, and broad-coverage articles from scratch. Looking ahead, addressing the outlined challenges and refining the system to better capture the depth, neutrality, and factual accuracy expected of high-quality informational content will be crucial. As this field continues to evolve, the contributions of STORM, coupled with a clear recognition of its limitations, provide a solid foundation for future advancements in the automation of expository writing.

Youtube Logo Streamline Icon: https://streamlinehq.com