Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mind the Style Gap: Meta-Evaluation of Style and Attribute Transfer Metrics (2502.15022v3)

Published 20 Feb 2025 in cs.CL

Abstract: LLMs make it easy to rewrite a text in any style -- e.g. to make it more polite, persuasive, or more positive -- but evaluation thereof is not straightforward. A challenge lies in measuring content preservation: that content not attributable to style change is retained. This paper presents a large meta-evaluation of metrics for evaluating style and attribute transfer, focusing on content preservation. We find that meta-evaluation studies on existing datasets lead to misleading conclusions about the suitability of metrics for content preservation. Widely used metrics show a high correlation with human judgments despite being deemed unsuitable for the task -- because they do not abstract from style changes when evaluating content preservation. We show that the overly high correlations with human judgment stem from the nature of the test data. To address this issue, we introduce a new, challenging test set specifically designed for evaluating content preservation metrics for style transfer. Using this dataset, we demonstrate that suitable metrics for content preservation for style transfer indeed are style-aware. To support efficient evaluation, we propose a new style-aware method that utilises small LLMs, obtaining a higher alignment with human judgements than prompting a model of a similar size as an autorater.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Amalie Brogaard Pauli (3 papers)
  2. Isabelle Augenstein (131 papers)
  3. Ira Assent (25 papers)

Summary

We haven't generated a summary for this paper yet.