Papers
Topics
Authors
Recent
Search
2000 character limit reached

Unifying Human and Statistical Evaluation for Natural Language Generation

Published 4 Apr 2019 in cs.CL, cs.AI, and stat.ML | (1904.02792v1)

Abstract: How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated. We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.

Citations (215)

Summary

  • The paper proposes a unified metric that combines human judgment with statistical measures to assess NLG output quality.
  • Its methodology quantifies the optimal accuracy to distinguish between human and machine-generated text across diverse NLG tasks.
  • Experiments reveal that while some tasks approach human-like performance, models are still 74% distinguishable from genuine human text.

Unifying Human and Statistical Evaluation for Natural Language Generation

The paper presents a novel approach to evaluate natural language generation (NLG) systems by unifying human judgment, traditionally considered the gold standard, with statistical metrics such as perplexity. The core issue addressed is the complementary failures of these two evaluation methods: human evaluators often fail to detect plagiarism in model outputs, while statistical measurements struggle to assess text quality.

A significant contribution of the paper is the proposal of a statistically rigorous evaluation metric that fuses human evaluation with model probabilities. This combined method estimates the optimal classification accuracy required to differentiate between human and machine-generated text. The proposed metric is characterized as interpretable, simple, and applicable across a wide range of NLG tasks.

In the experimental setup, the authors applied the evaluation metric to multiple tasks, including single-sentence language modeling, dialogue, story generation, and summarization. Interestingly, the study reveals that while measures such as language modeling exhibit higher indistinguishability from human-generated text, other tasks consistently show a significant gap; specifically, the best-performing models can still be distinguished from human counterparts 74% of the time. This finding suggests that, despite advancements, current state-of-the-art NLG models remain insufficient in replicating the complexity of human language use across these tasks.

The implications of this work are twofold. Practically, it provides a more robust framework for evaluating NLG systems that balances human intuition and statistical rigor, potentially guiding improvements in model development. Theoretically, it challenges the perception of existing models’ capabilities, presenting a more nuanced understanding of their limitations.

Looking ahead, this methodology could spearhead further research into the convergence of statistical and human evaluations, to refine NLG systems towards generating text indistinguishable from human writing. It also prompts future explorations into more sophisticated metrics that continue to bridge the gap between human and machine assessment paradigms, ultimately striving for models that perform at par with human linguistic creativity and fluency.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.