Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications (2404.07108v2)

Published 10 Apr 2024 in cs.CL and cs.IR

Abstract: Evaluating LLMs is fundamental, particularly in the context of practical applications. Conventional evaluation methods, typically designed primarily for LLM development, yield numerical scores that ignore the user experience. Therefore, our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications. Our proposed metric, termed Revision Distance,'' utilizes LLMs to suggest revision edits that mimic the human writing process. It is determined by counting the revision edits generated by LLMs. Benefiting from the generated revision edit details, our metric can provide a self-explained text evaluation result in a human-understandable manner beyond the context-independent score. Our results show that for the easy-writing task,Revision Distance'' is consistent with established metrics (ROUGE, Bert-score, and GPT-score), but offers more insightful, detailed feedback and better distinguishes between texts. Moreover, in the context of challenging academic writing tasks, our metric still delivers reliable evaluations where other metrics tend to struggle. Furthermore, our metric also holds significant potential for scenarios lacking reference texts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  2. Notus. https://github.com/argilla-io/notus.
  3. Non-repeatable experiments and non-reproducible results: The reproducibility crisis in human evaluation in NLP. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3676–3687, Toronto, Canada. Association for Computational Linguistics.
  4. Capturing relations between scientific papers: An abstractive model for related work section generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6068–6077, Online. Association for Computational Linguistics.
  5. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  6. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics.
  7. You can’t manage right what you can’t measure well: Technological innovation efficiency. Research Policy, 42(6):1239–1250.
  8. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  9. Multi-dimensional evaluation of text summarization with in-context learning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8487–8495, Toronto, Canada. Association for Computational Linguistics.
  10. Mistral 7b. arXiv preprint arXiv:2310.06825.
  11. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  12. CoAnnotating: Uncertainty-guided work allocation between human and large language models for data annotation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1487–1505, Singapore. Association for Computational Linguistics.
  13. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  14. Causal intervention for abstractive related work generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2148–2159, Singapore. Association for Computational Linguistics.
  15. OpenAI. GPT-4 technical report. Technical report, OpenAI.
  16. Language model self-improvement by reinforcement learning contemplation. arXiv preprint arXiv:2305.14483.
  17. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  18. The acl ocl corpus: advancing open science in computational linguistics. arXiv preprint arXiv:2305.14996.
  19. Recitation-augmented language models. In International Conference on Learning Representations.
  20. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  21. Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2717–2739, Toronto, Canada. Association for Computational Linguistics.
  22. Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems, volume 34, pages 27263–27277. Curran Associates, Inc.
  23. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  24. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
  25. DiscoScore: Evaluating text generation with BERT and discourse coherence. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3865–3883, Dubrovnik, Croatia. Association for Computational Linguistics.
  26. (inthe) wildchat: 570k chatgpt interaction logs in the wild. In The Twelfth International Conference on Learning Representations.
Citations (1)

Summary

  • The paper introduces Revision Distance as a novel metric that quantifies the number of human-like text revisions needed for quality improvement.
  • The paper leverages LLM-driven revision suggestions to mirror human editing processes, providing detailed feedback beyond conventional metrics.
  • The paper demonstrates that Revision Distance correlates with human judgment (up to 76%) and offers enhanced differentiation in both simple and complex writing tasks.

From Model-Centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-Based Applications

The paper "From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications" has introduced a novel metric, "Revision Distance," for evaluating text generated by LLMs from a user-centered perspective. This research presents a significant shift from traditional model-centric evaluation methods, which primarily rely on context-independent scores like ROUGE, BERT-Score, and GPT-Score. Instead, the proposed metric places emphasis on the user experience and interaction with LLM-powered writing assistant applications, reflecting a human-centered approach.

The core idea behind Revision Distance is to quantify the number of revisions needed for LLM-generated text to achieve a certain quality threshold, as would be assessed and edited by a human user. This metric leverages LLMs to suggest revision edits mimicking the human writing process, thus providing a nuanced and detailed evaluation beyond what simple similarity scores can offer. The Revision Distance metric is inspired by the classical edit distance metric but extends it by incorporating human-relevant features and aligns evaluations more closely with human perceptions.

The paper reports that in easy-writing tasks, Revision Distance correlates well with established metrics but offers more detailed feedback, thus improving text differentiation. This shows the metric's potential where other existing metrics may lack specificity in evaluation. In more challenging writing scenarios, such as academic writing, this metric provides more stable and reliable evaluations. Furthermore, the metric demonstrates notable effectiveness even in scenarios that lack reference texts, aligning closely with human judgment approximately 76% of the time in the tests conducted.

Several experiments were conducted to validate the utility of Revision Distance. For reference-based settings, scenarios included both easy-writing tasks (e.g., emails, articles) and challenge-writing tasks (e.g., academic writing related work sections) across different model strengths. Results indicated that Revision Distance offers a discriminative capacity superior to traditional metrics, particularly in complex writing tasks where knowledge reasoning is critical.

The implications of this approach extend into practical and theoretical domains. It enhances the evaluation framework for AI applications by introducing a metric that aligns with human-centric design principles, potentially informing model improvement strategies, as detailed revision actions provide targeted feedback. This work may guide the future development of LLMs and their application in more user-focused contexts, possibly influencing the design of AI systems to better simulate human editing techniques and preferences.

In conclusion, the Revision Distance metric represents a shift towards more human-centered evaluation metrics in the field of natural language processing and AI-based writing assistance. By reflecting real-world text revision processes, it highlights discrepancies not captured by traditional metrics and provides a transparent evaluation framework that could redefine how text quality assessments are conducted in the context of LLMs. Future research could focus on optimizing the application of this metric across various domains and minimizing computational costs, which is a noted limitation when employing models like GPT-4 extensively.

Youtube Logo Streamline Icon: https://streamlinehq.com