Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions (2403.03866v1)

Published 6 Mar 2024 in cs.CL

Abstract: LLMs adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. To evaluate the capabilities of current LLMs on this task, we construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain. Given a research question, an initial model-generated answer and a set of relevant papers, an expert annotator iteratively issues instructions for the model to revise and improve its answer. We collect 1,260 interaction turns from 234 interaction sessions with three state-of-the-art LLMs. Each turn includes a user instruction, a model response, and a human evaluation of the model response. Through a detailed analysis of the collected responses, we find that all models struggle to incorporate new information into an existing answer, and to perform precise and unambiguous edits. Further, we find that models struggle to judge whether their outputs successfully followed user instructions, with accuracy at least 10 points short of human agreement. Our findings indicate that KIWI will be a valuable resource to measure progress and improve LLMs' instruction-following capabilities for knowledge intensive writing tasks.

Analyzing Instruction-Following in AI Writing Assistants with \dsname

Introduction

Recent advancements in LLMs have significantly impacted various application areas, particularly in providing writing assistance and executing text revisions based on user instructions. Despite their widespread usage, our understanding of LLMs’ capabilities to effectively assist in writing within knowledge-intensive domains is limited. To address this gap, this paper introduces \dsname, a dataset meticulously designed to evaluate LLMs’ performance on a task often encountered in academic and research settings: revising long-form answers to accommodate user instructions which may involve adding, editing, or reorganizing information based on a set of scientific documents.

Dataset Construction

\dsname is constructed through an interactive system where expert annotators, possessing knowledge in the domain of NLP, collaborate with AI models to revise text. Given a research question alongside a compilation of relevant scientific papers, an LLM proposes a draft answer. The annotator then iteratively issues instructions to refine the draft—either by integrating additional information or by stylistic editing—until a satisfactory version is produced or a maximum iteration count is reached. This process is captured over 234 sessions working with three state-of-the-art LLMs, resulting in 1,260 unique interaction sequences.

Key Findings

Analysis of the dataset reveals that even the most capable LLMs often struggle to comprehensively address user instructions for text revision. Distinctly, the paper highlights challenges in incorporating new information and executing precise edits as per instruction constraints (e.g., specific locations or lengths). Specifically:

  • Incorporating New Information: Models notably faced difficulties in seamlessly adding new information into existing texts. Incorporating content from multiple documents into a coherent answer remains a challenge, suggesting areas for future improvement in multi-document summarization skills.
  • Precise and Constrained Editing: Tasks requiring precise edits or adherence to explicit instruction constraints often led to suboptimal model performance. This pinpointed a limitation in models' abilities to engage in controlled and constrained text generation.
  • Error Analysis: A fine-grained analysis identified common error patterns, with the most prevalent involving the models making unrequested changes to the text. This underscores the need for enhancing models' understanding of instruction scope and context to avoid overstepping boundaries set by specific user requests.

Implications and Future Directions

The insights gleaned from \dsname underscore the necessity for targeted improvements in LLMs tailored for knowledge-intensive writing tasks. It is evident that enhancing models' ability to precisely follow instructions, particularly in complex editing scenarios that require integrating diverse information sources or adhering to specific constraints, is crucial. Furthermore, the discovery of consistent error types offers a clear direction for refining models' text revision capabilities.

For future work, leveraging \dsname for training and evaluating LLMs presents an exciting pathway. Improvement trajectories could include better understanding and parsing of user instructions, advanced integration of information from multiple documents, and refined control over text manipulation to adhere closely to user-specified constraints. Through meticulous examination and targeted model enhancements, the goal of realizing highly adept AI-powered writing assistants for academic and research contexts becomes increasingly attainable.

Conclusion

\dsname provides a valuable resource for understanding and improving instruction-following capabilities of LLMs in the field of academic writing and revision tasks. By spotlighting existing shortcomings and highlighting areas for model advancements, this dataset paves the way for future research endeavors aimed at harnessing the full potential of AI in augmenting human writing processes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. A convergence theory for deep learning via over-parameterization. ArXiv, abs/1811.03962.
  2. Ms marco: A human generated machine reading comprehension dataset. ArXiv, abs/1611.09268.
  3. Cheng-Han Chiang and Hung-Yi Lee. 2023. Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics.
  4. Allan Collins and Dedre Gentner. 1980. A framework for a cognitive theory of writing, pages 51–72. Erlbaum.
  5. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  6. Gradient descent provably optimizes over-parameterized neural networks. ArXiv, abs/1810.02054.
  7. Understanding iterative revision from human-written text. ArXiv, abs/2203.03802.
  8. Alpacafarm: A simulation framework for methods that learn from human feedback. ArXiv, abs/2305.14387.
  9. Editeval: An instruction-based benchmark for text improvements. ArXiv, abs/2209.13331.
  10. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
  11. A global convergence theory for deep relu implicit networks via over-parameterization. ArXiv, abs/2110.05645.
  12. Small pre-trained language models can be fine-tuned as large models via over-parameterization. In Annual Meeting of the Association for Computational Linguistics.
  13. Google. 2023. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805.
  14. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356.
  15. Unsupervised dense information retrieval with contrastive learning.
  16. Openassistant conversations - democratizing large language model alignment. ArXiv, abs/2304.07327.
  17. Fengfu Li and Bin Liu. 2016. Ternary weight networks. ArXiv, abs/1605.04711.
  18. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  19. Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization. ArXiv, abs/2311.09184.
  20. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online. Association for Computational Linguistics.
  21. OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
  22. The shifted and the overlooked: A task-oriented investigation of user-gpt interactions. In Conference on Empirical Methods in Natural Language Processing.
  23. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  24. Coedit: Text editing by task-specific instruction tuning. ArXiv, abs/2305.09857.
  25. Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3:333–389.
  26. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States. Association for Computational Linguistics.
  27. Summarizing, simplifying, and synthesizing medical evidence using GPT-3 (with varying success). In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1387–1407, Toronto, Canada. Association for Computational Linguistics.
  28. Beyond summarization: Designing ai support for real-world expository writing tasks. ArXiv, abs/2304.02623.
  29. Rewritelm: An instruction-tuned large language model for text rewriting. ArXiv, abs/2305.15685.
  30. Evaluating large language models on controlled generation tasks. In Conference on Empirical Methods in Natural Language Processing.
  31. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  32. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
  33. Evaluating large language models at evaluating instruction following. ArXiv, abs/2310.07641.
  34. Xatu: A fine-grained instruction-based benchmark for explainable text updates. ArXiv, abs/2309.11063.
  35. (inthe)wildchat: 650k chatgpt interaction logs in the wild.
  36. Lmsys-chat-1m: A large-scale real-world llm conversation dataset. ArXiv, abs/2309.11998.
  37. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv, abs/2306.05685.
  38. Controlled text generation with natural language instructions. ArXiv, abs/2304.14293.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Fangyuan Xu (10 papers)
  2. Kyle Lo (73 papers)
  3. Luca Soldaini (62 papers)
  4. Bailey Kuehl (18 papers)
  5. Eunsol Choi (76 papers)
  6. David Wadden (24 papers)
Citations (5)