Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Published 5 Sep 2019 in cs.CL | (1909.02622v2)

Abstract: A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.

Citations (546)

Summary

  • The paper redefines text generation evaluation as a semantic distance task using contextualized embeddings and Earth Mover’s Distance.
  • It shows that MoverScore correlates highly with human judgments across tasks like machine translation, summarization, and image captioning.
  • The study highlights the potential to replace traditional n-gram metrics with a unified evaluation framework based on deep semantic understanding.

Analysis of "MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance"

This paper introduces "MoverScore," a metric for evaluating text generation systems by quantifying the semantic similarity between system-generated and reference texts using contextualized embeddings and Earth Mover's Distance (EMD). The motivation stems from the limitations of traditional metrics, such as BLEU and ROUGE, which rely heavily on n-gram co-occurrence and often fail to capture semantic equivalence when the surface forms differ significantly.

Contributions and Methodology

MoverScore redefines the evaluation problem by framing it as a semantic distance measurement task. The approach leverages powerful contextualized representations, such as ELMo and BERT, to encode the nuances of text and apply EMD to quantify semantic displacement between the system and reference outputs.

The metric's key innovations include:

  • Utilizing EMD to account for semantic and syntactic deviations by aligning semantically similar words and calculating the minimal flow needed to transform one text into another.
  • Applying contextualized embeddings to capture semantic meaning at the word and sentence level, thus providing a comprehensive measure of similarity.
  • Demonstrating the robustness of MoverScore across diverse tasks, including machine translation, text summarization, image captioning, and dialogue response generation.

Key Results

Empirical evaluations highlight that MoverScore exhibits a high correlation with human judgments and often surpasses both traditional and supervised baselines:

  • In machine translation tasks, MoverScore shows an average Pearson correlation of 74.3, outperforming RUSE, a supervised metric, by a notable margin.
  • For text summarization, it achieves correlations that are comparable to or exceed those of traditional metrics such as ROUGE.
  • In image captioning, the metric closely aligns with human evaluations, performing favorably against both unsupervised and supervised methods.
  • Though less effective in dialogue systems, likely due to the complexity of capturing entity and number representations, MoverScore still leads among unsupervised evaluations.

Implications and Future Directions

MoverScore's reliance on pre-trained models like BERT reflects a broader trend towards leveraging deep contextualized representations to enhance NLP evaluations. Its success across multiple text generation tasks supports the notion that semantic understanding, rather than mere surface similarity, is crucial for accurate assessment.

The paper suggests that MoverScore could potentially replace several existing metrics, fostering a more unified evaluation framework. Future explorations could focus on reducing the dependency on reference texts altogether, aiming for a fully unsupervised evaluation that utilizes only source and generated texts.

In conclusion, the MoverScore metric represents an advancement in the evaluation of text generation systems, demonstrating that semantic similarity, effectively measured using contextualized embeddings and EMD, offers a reliable indicator of text quality. The metric's application across different domains highlights its generalization capacity and points to promising directions for future research in NLP assessment methodologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.