From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications (2404.07108v2)

Published 10 Apr 2024 in cs.CL and cs.IR

Abstract: Evaluating LLMs is fundamental, particularly in the context of practical applications. Conventional evaluation methods, typically designed primarily for LLM development, yield numerical scores that ignore the user experience. Therefore, our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications. Our proposed metric, termed Revision Distance,'' utilizes LLMs to suggest revision edits that mimic the human writing process. It is determined by counting the revision edits generated by LLMs. Benefiting from the generated revision edit details, our metric can provide a self-explained text evaluation result in a human-understandable manner beyond the context-independent score. Our results show that for the easy-writing task,Revision Distance'' is consistent with established metrics (ROUGE, Bert-score, and GPT-score), but offers more insightful, detailed feedback and better distinguishes between texts. Moreover, in the context of challenging academic writing tasks, our metric still delivers reliable evaluations where other metrics tend to struggle. Furthermore, our metric also holds significant potential for scenarios lacking reference texts.

References (26)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Revision Distance as a novel metric that quantifies the number of human-like text revisions needed for quality improvement.
The paper leverages LLM-driven revision suggestions to mirror human editing processes, providing detailed feedback beyond conventional metrics.
The paper demonstrates that Revision Distance correlates with human judgment (up to 76%) and offers enhanced differentiation in both simple and complex writing tasks.

From Model-Centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-Based Applications

The paper "From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications" has introduced a novel metric, "Revision Distance," for evaluating text generated by LLMs from a user-centered perspective. This research presents a significant shift from traditional model-centric evaluation methods, which primarily rely on context-independent scores like ROUGE, BERT-Score, and GPT-Score. Instead, the proposed metric places emphasis on the user experience and interaction with LLM-powered writing assistant applications, reflecting a human-centered approach.

The core idea behind Revision Distance is to quantify the number of revisions needed for LLM-generated text to achieve a certain quality threshold, as would be assessed and edited by a human user. This metric leverages LLMs to suggest revision edits mimicking the human writing process, thus providing a nuanced and detailed evaluation beyond what simple similarity scores can offer. The Revision Distance metric is inspired by the classical edit distance metric but extends it by incorporating human-relevant features and aligns evaluations more closely with human perceptions.

The paper reports that in easy-writing tasks, Revision Distance correlates well with established metrics but offers more detailed feedback, thus improving text differentiation. This shows the metric's potential where other existing metrics may lack specificity in evaluation. In more challenging writing scenarios, such as academic writing, this metric provides more stable and reliable evaluations. Furthermore, the metric demonstrates notable effectiveness even in scenarios that lack reference texts, aligning closely with human judgment approximately 76% of the time in the tests conducted.

Several experiments were conducted to validate the utility of Revision Distance. For reference-based settings, scenarios included both easy-writing tasks (e.g., emails, articles) and challenge-writing tasks (e.g., academic writing related work sections) across different model strengths. Results indicated that Revision Distance offers a discriminative capacity superior to traditional metrics, particularly in complex writing tasks where knowledge reasoning is critical.

The implications of this approach extend into practical and theoretical domains. It enhances the evaluation framework for AI applications by introducing a metric that aligns with human-centric design principles, potentially informing model improvement strategies, as detailed revision actions provide targeted feedback. This work may guide the future development of LLMs and their application in more user-focused contexts, possibly influencing the design of AI systems to better simulate human editing techniques and preferences.

In conclusion, the Revision Distance metric represents a shift towards more human-centered evaluation metrics in the field of natural language processing and AI-based writing assistance. By reflecting real-world text revision processes, it highlights discrepancies not captured by traditional metrics and provides a transparent evaluation framework that could redefine how text quality assessments are conducted in the context of LLMs. Future research could focus on optimizing the application of this metric across various domains and minimizing computational costs, which is a noted limitation when employing models like GPT-4 extensively.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1829166955015581911

https://twitter.com/zhiQ/status/1778478935829066130

YouTube

Show All Videos

HackerNews

Revision Distance as a Metric for Text Evaluation in LLMs-Based Applications (2 points, 0 comments)