Unifying Human and Statistical Evaluation for Natural Language Generation (1904.02792v1)

Published 4 Apr 2019 in cs.CL, cs.AI, and stat.ML

Abstract: How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated. We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.

PDF Abstract

Unifying Human and Statistical Evaluation for Natural Language Generation

The paper presents a novel approach to evaluate natural language generation (NLG) systems by unifying human judgment, traditionally considered the gold standard, with statistical metrics such as perplexity. The core issue addressed is the complementary failures of these two evaluation methods: human evaluators often fail to detect plagiarism in model outputs, while statistical measurements struggle to assess text quality.

A significant contribution of the paper is the proposal of a statistically rigorous evaluation metric that fuses human evaluation with model probabilities. This combined method estimates the optimal classification accuracy required to differentiate between human and machine-generated text. The proposed metric is characterized as interpretable, simple, and applicable across a wide range of NLG tasks.

In the experimental setup, the authors applied the evaluation metric to multiple tasks, including single-sentence LLMing, dialogue, story generation, and summarization. Interestingly, the paper reveals that while measures such as LLMing exhibit higher indistinguishability from human-generated text, other tasks consistently show a significant gap; specifically, the best-performing models can still be distinguished from human counterparts 74% of the time. This finding suggests that, despite advancements, current state-of-the-art NLG models remain insufficient in replicating the complexity of human language use across these tasks.

The implications of this work are twofold. Practically, it provides a more robust framework for evaluating NLG systems that balances human intuition and statistical rigor, potentially guiding improvements in model development. Theoretically, it challenges the perception of existing models’ capabilities, presenting a more nuanced understanding of their limitations.

Looking ahead, this methodology could spearhead further research into the convergence of statistical and human evaluations, to refine NLG systems towards generating text indistinguishable from human writing. It also prompts future explorations into more sophisticated metrics that continue to bridge the gap between human and machine assessment paradigms, ultimately striving for models that perform at par with human linguistic creativity and fluency.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Tatsunori B. Hashimoto (23 papers)
Hugh Zhang (13 papers)
Percy Liang (239 papers)

Citations (215)

View on Semantic Scholar

Unifying Human and Statistical Evaluation for Natural Language Generation (1904.02792v1)

Unifying Human and Statistical Evaluation for Natural Language Generation

Related Papers