Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text (2202.06935v1)

Published 14 Feb 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surface-level features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for NLG evaluation and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 NLG papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.

PDF Abstract

Critical Analysis of NLG Evaluation Challenges and Practices

The paper "Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text" meticulously surveys the longstanding challenges and inadequacies within the evaluation paradigms for natural language generation (NLG). The core issue identified is the outdated nature of these evaluation methods, which are increasingly inadequate in assessing the capabilities of advanced neural NLG models. As the models have evolved, the disparity between their sophistication and the simplistic evaluation techniques has widened, necessitating a comprehensive reevaluation and restructuring of the evaluation frameworks.

Key Challenges in Evaluation Practices

The paper categorizes evaluation challenges into several interrelated areas:

Datasets: A significant portion of the problem originates from the datasets used for training and evaluation. These datasets are predominantly English-centric, limiting the generalizability of model capabilities across other languages. Moreover, they are often constructed without adequate consideration for robustness to tail effects, resulting in evaluations that are not representative of real-world applications.
Automatic Metrics: Traditional metrics such as BLEU and ROUGE focus primarily on surface-level features like lexical overlap, which fails to capture the nuanced quality of text such as its coherence, factual consistency, and relevance to the input context. The paper highlights the pitfalls of relying too heavily on these metrics, arguing that they can lead to overestimations of model proficiency.
Human Evaluations: Although considered a gold standard, human evaluations are fraught with challenges including high variance, lack of replicability, and inconsistent reporting practices. Additionally, the subjective and often ambiguous nature of evaluation criteria further complicates the reliability of human assessment.

Survey Findings and Recommendations

The survey analyzed 66 recent NLG papers, revealing only marginal adherence to recommended practices, with an average of 27% compliance across 29 dimensions. This suggests a glaring need for more rigorous evaluation strategies that acknowledge model limitations and encourage transparency and reproducibility.

To that end, the authors propose multiple avenues for improving evaluation methodologies:

Enhanced Documentation: Adopting structured documentation practices for datasets and models, such as data and model cards, could enhance transparency and facilitate better understanding and replicability of experimental results.
Comprehensive Evaluation Reports: These reports should integrate various evaluation metrics, human assessments, and error analyses to offer a multi-dimensional view of model performance. This holistic approach would provide more actionable insights compared to monolithic evaluation scores that obscure nuanced performance details.
Development of Evaluation Suites: These suites would include curated test sets designed to target specific capabilities and limitations of NLG models, enabling finer granularity in the evaluation results.
Robust and Inclusive Human Evaluation Practices: This includes considering the socio-demographic backgrounds of annotators to capture a wide array of perspectives and ensure fair representation of diverse linguistic nuances and cultural contexts.

Future Directions and Implications

In conclusion, the paper outlines a compelling case for overhauling the existing evaluation frameworks in NLG. The proposals emphasize a shift from competing for superior performance metrics to achieving clarity about model limitations and potential failure modes. As NLG models are increasingly deployed in high-stakes applications, the importance of robust evaluation practices that reflect the complexity and nuances of human language cannot be overstated. By fostering a culture of comprehensive and transparent evaluation, the research community can align more closely with societal needs and ethical standards, while also paving the way for innovative advancements in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Sebastian Gehrmann (48 papers)
Elizabeth Clark (16 papers)
Thibault Sellam (19 papers)

Citations (169)

View on Semantic Scholar

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text (2202.06935v1)

Critical Analysis of NLG Evaluation Challenges and Practices

Key Challenges in Evaluation Practices

Survey Findings and Recommendations

Future Directions and Implications

Related Papers