Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates (2309.11063v2)

Published 20 Sep 2023 in cs.CL

Abstract: Text editing is a crucial task of modifying text to better align with user intents. However, existing text editing benchmark datasets contain only coarse-grained instructions and lack explainability, thus resulting in outputs that deviate from the intended changes outlined in the gold reference. To comprehensively investigate the text editing capabilities of LLMs, this paper introduces XATU, the first benchmark specifically designed for fine-grained instruction-based explainable text editing. XATU considers finer-grained text editing tasks of varying difficulty (simplification, grammar check, fact-check, etc.), incorporating lexical, syntactic, semantic, and knowledge-intensive edit aspects. To enhance interpretability, we combine LLM-based annotation and human annotation, resulting in a benchmark that includes fine-grained instructions and gold-standard edit explanations. By evaluating existing LLMs against our benchmark, we demonstrate the effectiveness of instruction tuning and the impact of underlying architecture across various editing tasks. Furthermore, extensive experimentation reveals the significant role of explanations in fine-tuning LLMs for text editing tasks. The benchmark will be open-sourced to support reproduction and facilitate future research at~\url{https://github.com/megagonlabs/xatu}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs. arXiv preprint arXiv:2305.08844.
  2. ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4668–4679, Online. Association for Computational Linguistics.
  3. wikiHowToImprove: A resource and analyses on edits in instructional texts. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5721–5729, Marseille, France. European Language Resources Association.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  7. Allan Collins and Dedre Gentner. 1980. A framework for a cognitive theory of writing. In Cognitive Processes in Writing, pages 51–72. Erlbaum.
  8. Chrysanne DiMarco and Graeme Hirst. 1993. A computational theory of goal-directed style in syntax. Computational Linguistics, 19(3):451–500.
  9. Understanding iterative revision from human-written text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3573–3590, Dublin, Ireland. Association for Computational Linguistics.
  10. Editeval: An instruction-based benchmark for text improvements. arXiv preprint arXiv:2209.13331.
  11. Text editing by command. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5259–5274, Online. Association for Computational Linguistics.
  12. WikiAtomicEdits: A multilingual corpus of Wikipedia edits for modeling language and discourse. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 305–315, Brussels, Belgium. Association for Computational Linguistics.
  13. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450.
  14. John R Hayes and Linda S Flower. 1980. The dynamics of composing: Making plans and juggling constraints. Cognitive processes in writing, pages 31–50.
  15. Instruction induction: From few examples to natural language task descriptions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1935–1952, Toronto, Canada. Association for Computational Linguistics.
  16. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  17. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  18. Fact-based Text Editing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 171–182, Online. Association for Computational Linguistics.
  19. FRUIT: Faithfully reflecting updated information in text. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3670–3686, Seattle, United States. Association for Computational Linguistics.
  20. On improving summarization factual consistency from natural language feedback. arXiv preprint arXiv:2212.09968.
  21. StylePTB: A compositional benchmark for fine-grained controllable text style transfer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2116–2138, Online. Association for Computational Linguistics.
  22. Encode, tag, realize: High-precision text editing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5054–5065, Hong Kong, China. Association for Computational Linguistics.
  23. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  24. JFLEG: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 229–234, Valencia, Spain. Association for Computational Linguistics.
  25. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
  26. OpenAI. 2023. Gpt-4 technical report.
  27. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  28. Automatically neutralizing subjective bias in text. In Proceedings of the aaai conference on artificial intelligence, volume 34, pages 480–489.
  29. Language models are unsupervised multitask learners. Technical report, Open AI.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
  31. Machel Reid and Graham Neubig. 2022. Learning to model editing processes. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3822–3832, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  32. PEER: A collaborative language model. In The Eleventh International Conference on Learning Representations.
  33. NewsEdits: A news article revision dataset and a novel document-level reasoning challenge. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 127–157, Seattle, United States. Association for Computational Linguistics.
  34. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  35. UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
  36. James Thorne and Andreas Vlachos. 2021. Evidence-based factual error correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3298–3309, Online. Association for Computational Linguistics.
  37. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  39. Marie M. Vaughan and David D. McDonald. 1986. A model of revision in natural language generation. In 24th Annual Meeting of the Association for Computational Linguistics, pages 90–96, New York, New York, USA. Association for Computational Linguistics.
  40. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks.
  41. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  42. Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations.
  43. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.
  44. Identifying semantic edit intentions from revisions in wikipedia. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2000–2010.
  45. Learning to represent edits. In International Conference on Learning Representations.
  46. Why johnny can’t prompt: How non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA. Association for Computing Machinery.
  47. Extractive summarization via chatgpt for faithful summary generation. arXiv preprint arXiv:2304.04193.
  48. Summit: Iterative text summarization via chatgpt. arXiv preprint arXiv:2305.14835.
  49. Opt: Open pre-trained transformer language models.
  50. Integrating transformer and paraphrase rules for sentence simplification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3164–3173, Brussels, Belgium. Association for Computational Linguistics.
  51. Yang Zhong. 2021. WIKIBIAS: Detecting Multi-Span Subjective Biases in Language. Ph.D. thesis, The Ohio State University.
  52. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  53. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations.
Citations (6)

Summary

We haven't generated a summary for this paper yet.