Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance (1909.02622v2)

Published 5 Sep 2019 in cs.CL

Abstract: A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.

Analysis of "MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance"

This paper introduces "MoverScore," a metric for evaluating text generation systems by quantifying the semantic similarity between system-generated and reference texts using contextualized embeddings and Earth Mover's Distance (EMD). The motivation stems from the limitations of traditional metrics, such as BLEU and ROUGE, which rely heavily on n-gram co-occurrence and often fail to capture semantic equivalence when the surface forms differ significantly.

Contributions and Methodology

MoverScore redefines the evaluation problem by framing it as a semantic distance measurement task. The approach leverages powerful contextualized representations, such as ELMo and BERT, to encode the nuances of text and apply EMD to quantify semantic displacement between the system and reference outputs.

The metric's key innovations include:

  • Utilizing EMD to account for semantic and syntactic deviations by aligning semantically similar words and calculating the minimal flow needed to transform one text into another.
  • Applying contextualized embeddings to capture semantic meaning at the word and sentence level, thus providing a comprehensive measure of similarity.
  • Demonstrating the robustness of MoverScore across diverse tasks, including machine translation, text summarization, image captioning, and dialogue response generation.

Key Results

Empirical evaluations highlight that MoverScore exhibits a high correlation with human judgments and often surpasses both traditional and supervised baselines:

  • In machine translation tasks, MoverScore shows an average Pearson correlation of 74.3, outperforming RUSE, a supervised metric, by a notable margin.
  • For text summarization, it achieves correlations that are comparable to or exceed those of traditional metrics such as ROUGE.
  • In image captioning, the metric closely aligns with human evaluations, performing favorably against both unsupervised and supervised methods.
  • Though less effective in dialogue systems, likely due to the complexity of capturing entity and number representations, MoverScore still leads among unsupervised evaluations.

Implications and Future Directions

MoverScore's reliance on pre-trained models like BERT reflects a broader trend towards leveraging deep contextualized representations to enhance NLP evaluations. Its success across multiple text generation tasks supports the notion that semantic understanding, rather than mere surface similarity, is crucial for accurate assessment.

The paper suggests that MoverScore could potentially replace several existing metrics, fostering a more unified evaluation framework. Future explorations could focus on reducing the dependency on reference texts altogether, aiming for a fully unsupervised evaluation that utilizes only source and generated texts.

In conclusion, the MoverScore metric represents an advancement in the evaluation of text generation systems, demonstrating that semantic similarity, effectively measured using contextualized embeddings and EMD, offers a reliable indicator of text quality. The metric's application across different domains highlights its generalization capacity and points to promising directions for future research in NLP assessment methodologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wei Zhao (309 papers)
  2. Maxime Peyrard (33 papers)
  3. Fei Liu (232 papers)
  4. Yang Gao (761 papers)
  5. Christian M. Meyer (13 papers)
  6. Steffen Eger (90 papers)
Citations (546)