BERTScore: Evaluating Text Generation with BERT (1904.09675v3)

Published 21 Apr 2019 in cs.CL

Abstract: We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.

PDF Abstract

An Analysis of BERTScore: Evaluating Text Generation with BERT

Introduction

The evaluation of text generation tasks, such as machine translation and image captioning, typically hinges on comparing generated sentences to reference sentences annotated by humans. Conventionally, metrics like BLEU rely on surface-level features such as $n$ -gram overlap, which often fall short when capturing the semantic nuances and linguistic diversity pivotal to natural language understanding. The paper "BERTScore: Evaluating Text Generation with BERT" introduces BERTScore, a robust evaluation metric leveraging BERT embeddings to compute token-level similarities between candidate and reference sentences.

BERTScore Methodology

BERTScore diverges from traditional methods by harnessing the power of pre-trained BERT embeddings to evaluate text generation. Unlike BLEU or ROUGE, which operate on exact matching of $n$ -grams, BERTScore utilizes contextual embeddings for token-level matching based on cosine similarity.

Token Representation

Tokens in both candidate and reference sentences are represented using contextual embeddings derived from BERT. These embeddings consider the entire sentence context, allowing each token's representation to dynamically adapt based on surrounding words. This approach contrasts sharply with static word embeddings like Word2Vec and GloVe, which fail to capture token-specific contextual nuances.

Similarity Measure and BERTScore Calculation

Token similarity is quantified using cosine similarity between their embeddings. BERTScore is computed by performing a greedy matching of tokens from the candidate to the reference sentence to maximize similarity, followed by calculating precision, recall, and an F1-score:

Precision ( $P_{bert}$ ): Measures the proportion of tokens in the candidate sentence that find similar tokens in the reference.
Recall ( $R_{bert}$ ): Assesses the proportion of tokens in the reference sentence that find similar tokens in the candidate.
F1-score ( $F_{bert}$ ): Harmonizes precision and recall to provide a balanced evaluation.

Importance Weighting and Baseline Rescaling

To enhance evaluation sensitivity towards informative words, BERTScore incorporates Inverse Document Frequency (IDF) weighting. It adjusts token importance based on their rarity, thus emphasizing less frequent but contextually significant tokens. Further, a baseline rescaling ensures scores remain user-readable.

Experimental Validation

Extensive experiments validate BERTScore’s efficacy across different tasks, including machine translation and image captioning.

Machine Translation

Using the WMT18 dataset, BERTScore consistently outperforms existing metrics like BLEU and Meteor. Notably, it demonstrates superior segment-level and system-level correlation with human judgments. BERTScore exhibits robust performance in system-level correlations on hybrid systems, facilitating fine-grained model selection.

Image Captioning

On the COCO Captioning Challenge, BERTScore surpasses task-agnostic metrics like BLEU and ROUGE and approaches the performance of task-specific metrics. The IDF weighting contributes significantly to performance, reflecting the necessity of content word emphasis in image captioning tasks.

Adversarial Robustness

Evaluations on adversarial paraphrase datasets (PAWS) reveal BERTScore’s robustness against challenging examples, indicating its resilience where traditional metrics stumble. Unlike standard sentence similarity tasks, BERTScore maintains high performance even when dealing with complex word order permutations and subtle lexical changes.

Implications and Future Directions

BERTScore's introduction marks a considerable step forward in providing a semantically rich evaluation metric for text generation. It ameliorates the limitations of traditional $n$ -gram based metrics, offering a nuanced understanding that aligns closely with human judgment. Its applicability across diverse language pairs and tasks underscores its versatility.

Future research might explore domain-specific optimizations of BERTScore. Additionally, incorporating BERTScore directly into the learning objectives for text generation models could bridge the gap between model training and evaluation, fostering the development of more semantically coherent and contextually aware generation systems.

Conclusion

BERTScore presents a sophisticated approach to evaluating natural language generation, capitalizing on the contextual richness of BERT embeddings. With its demonstrated robustness and high correlation with human assessments, it paves the way for more meaningful and accurate text evaluation metrics, adaptable across varied linguistic and generative tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Tianyi Zhang (262 papers)
Varsha Kishore (8 papers)
Felix Wu (30 papers)
Kilian Q. Weinberger (105 papers)
Yoav Artzi (51 papers)

Citations (4,748)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/GlennMatlin/status/1926852823242690919

YouTube

Show All Videos