A Survey of Evaluation Metrics Used for NLG Systems (2008.12009v2)

Published 27 Aug 2020 in cs.CL

Abstract: The success of Deep Learning has created a surge in interest in a wide a range of Natural Language Generation (NLG) tasks. Deep Learning has not only pushed the state of the art in several existing NLG tasks but has also facilitated researchers to explore various newer NLG tasks such as image captioning. Such rapid progress in NLG has necessitated the development of accurate automatic evaluation metrics that would allow us to track the progress in the field of NLG. However, unlike classification tasks, automatically evaluating NLG systems in itself is a huge challenge. Several works have shown that early heuristic-based metrics such as BLEU, ROUGE are inadequate for capturing the nuances in the different NLG tasks. The expanding number of NLG models and the shortcomings of the current metrics has led to a rapid surge in the number of evaluation metrics proposed since 2014. Moreover, various evaluation metrics have shifted from using pre-determined heuristic-based formulae to trained transformer models. This rapid change in a relatively short time has led to the need for a survey of the existing NLG metrics to help existing and new researchers to quickly come up to speed with the developments that have happened in NLG evaluation in the last few years. Through this survey, we first wish to highlight the challenges and difficulties in automatically evaluating NLG systems. Then, we provide a coherent taxonomy of the evaluation metrics to organize the existing metrics and to better understand the developments in the field. We also describe the different metrics in detail and highlight their key contributions. Later, we discuss the main shortcomings identified in the existing metrics and describe the methodology used to evaluate evaluation metrics. Finally, we discuss our suggestions and recommendations on the next steps forward to improve the automatic evaluation metrics.

PDF Abstract

Overview of Evaluation Metrics for Natural Language Generation Systems

The paper "A Survey of Evaluation Metrics Used for NLG Systems" provides a comprehensive overview of the evolution and current state of automatic evaluation metrics for Natural Language Generation (NLG) tasks. Authored by Ananya B. Sai, Akash Kumar Mohankumar, and Mitesh M. Khapra, the paper aims to address the growing complexity and variety of NLG tasks, the inadequacies of traditional evaluation methods, and the emergence of more sophisticated metrics.

Context and Motivation

The success of Deep Learning has catalyzed advancements in the field of NLG, leading to the development of both traditional tasks such as machine translation and newer tasks like image captioning. This rapid expansion necessitates the evolution of evaluation metrics that are adaptable and capable of capturing the nuances of these diverse tasks. The paper highlights the inadequacies of early heuristic-based metrics like BLEU and ROUGE, which were primarily designed for specific tasks and do not generalize well to newer challenges or capture the intricacies required for effective evaluation.

Taxonomy of Evaluation Metrics

The authors propose a taxonomy to categorize evaluation metrics, dividing them into context-free and context-dependent metrics.

Context-Free Metrics: These metrics do not take the input context into account and instead score the output based solely on its similarity to reference outputs. They include:
- Untrained Metrics: Such as BLEU and ROUGE, these rely on pre-defined heuristics and are limited in their interpretive power.
- Trained Metrics: These leverage machine learning to combine various features or are trained end-to-end using data, potentially adapting to task-specific nuances.
Context-Dependent Metrics: These metrics incorporate the input context, aiming to evaluate outputs in a manner that integrates an understanding of the task's demands.
- Untrained Metrics: Include adaptations of existing methods to utilize context, such as those augmented with semantic embeddings.
- Trained Metrics: Typically use deep learning to account for context in evaluating task-specific aspects, such as relevance and coherence in dialogue systems.

Critical Appraisal of Existing Metrics

The paper critiques existing metrics on several fronts:

Correlation with Human Judgements: Many metrics exhibit poor correlation with human assessments, particularly at the sentence level.
Bias and Uninterpretability: Several metrics are found to be biased towards specific models or characteristics, hindering their general applicability. Furthermore, the lack of interpretability limits their utility in providing actionable insights for system improvement.
Dataset Dependencies: The effectiveness of metrics can vary significantly with the datasets used to train and evaluate them, highlighting a need for robustness across diverse data sources.

Recommendations and Future Directions

The authors recommend several pathways for advancing the development of evaluation metrics:

Development of Task-Specific Context-Dependent Metrics: By focusing on the unique requirements of different NLG tasks, these metrics could offer more reliable and relevant evaluations.
Construction of Comprehensive Datasets: These datasets should include rich annotations reflecting various dimensions of quality, enabling the training of sophisticated metrics.
Aggregating Diverse Evaluation Approaches: Combining the strengths of statistical and semantic-based evaluation methods could lead to more holistic assessments.
Promoting Transparency and Standardization: Open codebases and standard protocols are vital for the reproducibility and comparability of evaluation efforts across different NLG tasks.

Conclusion

The paper underscores the complexity of designing evaluation metrics that are both adaptable and robust across diverse NLG tasks. By highlighting the challenges and recommending future research directions, the authors provide a roadmap for advancing the field of NLG evaluation, aiming for metrics that closely align with human judgments and facilitate the development of more effective generation systems. As the field progresses, continued scrutiny and iteration on metric development will be essential to meet the dynamic needs of NLG applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Ananya B. Sai (11 papers)
Akash Kumar Mohankumar (9 papers)
Mitesh M. Khapra (79 papers)

Citations (211)

View on Semantic Scholar