Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unifying Human and Statistical Evaluation for Natural Language Generation (1904.02792v1)

Published 4 Apr 2019 in cs.CL, cs.AI, and stat.ML

Abstract: How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated. We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.

Unifying Human and Statistical Evaluation for Natural Language Generation

The paper presents a novel approach to evaluate natural language generation (NLG) systems by unifying human judgment, traditionally considered the gold standard, with statistical metrics such as perplexity. The core issue addressed is the complementary failures of these two evaluation methods: human evaluators often fail to detect plagiarism in model outputs, while statistical measurements struggle to assess text quality.

A significant contribution of the paper is the proposal of a statistically rigorous evaluation metric that fuses human evaluation with model probabilities. This combined method estimates the optimal classification accuracy required to differentiate between human and machine-generated text. The proposed metric is characterized as interpretable, simple, and applicable across a wide range of NLG tasks.

In the experimental setup, the authors applied the evaluation metric to multiple tasks, including single-sentence LLMing, dialogue, story generation, and summarization. Interestingly, the paper reveals that while measures such as LLMing exhibit higher indistinguishability from human-generated text, other tasks consistently show a significant gap; specifically, the best-performing models can still be distinguished from human counterparts 74% of the time. This finding suggests that, despite advancements, current state-of-the-art NLG models remain insufficient in replicating the complexity of human language use across these tasks.

The implications of this work are twofold. Practically, it provides a more robust framework for evaluating NLG systems that balances human intuition and statistical rigor, potentially guiding improvements in model development. Theoretically, it challenges the perception of existing models’ capabilities, presenting a more nuanced understanding of their limitations.

Looking ahead, this methodology could spearhead further research into the convergence of statistical and human evaluations, to refine NLG systems towards generating text indistinguishable from human writing. It also prompts future explorations into more sophisticated metrics that continue to bridge the gap between human and machine assessment paradigms, ultimately striving for models that perform at par with human linguistic creativity and fluency.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tatsunori B. Hashimoto (23 papers)
  2. Hugh Zhang (13 papers)
  3. Percy Liang (239 papers)
Citations (215)