Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text (2107.00061v2)

Published 30 Jun 2021 in cs.CL
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text

Abstract: Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore three approaches for quickly training evaluators to better identify GPT3-authored text (detailed instructions, annotated examples, and paired examples) and find that while evaluators' accuracy improved up to 55%, it did not significantly improve across the three domains. Given the inconsistent results across text domains and the often contradictory reasons evaluators gave for their judgments, we examine the role untrained human evaluations play in NLG evaluation and provide recommendations to NLG researchers for improving human evaluations of text generated from state-of-the-art models.

An Evaluation of Human Assessments of Generated Text

The paper "All That's `Human' Is Not Gold: Evaluating Human Evaluation of Generated Text" presents a critical analysis of human evaluation processes applied to text generated by advanced natural language generation (NLG) models like GPT2 and GPT3. As the performance of NLG systems approaches human-like text fluency, the effectiveness of current human evaluation methods is brought into question. The authors conduct an empirical paper to assess whether non-expert evaluators can distinguish between human-authored and machine-generated text across different domains—stories, news articles, and recipes—and they explore strategies for improving evaluation accuracy.

The paper reveals that untrained evaluators are especially inept at distinguishing GPT3-generated text from human-authored text, with many evaluators performing at random chance levels. This outcome underscores significant hurdles in relying on non-expert human evaluations as the "gold standard" for quality assessment in automatically generated text. The results were consistent across various content domains, indicating a systemic issue with the present evaluation frameworks. Evaluators largely focused on superficial text features like grammar and style when attempting to differentiate between human and machine authorship, attributes which state-of-the-art models like GPT3 have largely mastered.

To remedy these evaluation challenges, the authors examined three evaluator training methodologies: providing detailed instructions, giving annotated examples, and offering comparison examples of human versus machine-written text. The findings show that while examples improved evaluators' capacity to identify machine-generated text, the improvement was not robust across all text domains. Thus, the authors argue for the need to abandon simplistic evaluation tasks done in small batches with little training and advise NLG researchers to adopt more comprehensive and interactive evaluation strategies.

One of the paper's key contributions is identifying an evaluator's reliance on the perceived limitations of machine-generated language, such as the machine's capability to express emotion or humor, as a misleading factor. Such misbeliefs often detract from an evaluator's ability to accurately judge the origin of a text and suggest a need for recalibrating evaluator expectations of machine generation capabilities.

The implications of this research are multifold. Practically, it advises against using non-expert, small-batch human evaluations for state-of-the-art NLG model outputs. Theoretically, it emphasizes reevaluating what constitutes human-likeness within generated text and motivates a shift towards assessment metrics that better capture text usefulness and content quality rather than superficial human-like characteristics. This shift is crucial as the applications of NLG systems grow and integrate more deeply into human societies.

Future directions could include refining evaluator training schemes to better address misconceptions about machine capabilities and looking into automated evaluation techniques that can complement human evaluations. This work calls into focus the need for robust methodologies to ensure the effective integration of NLG systems in real-world applications while maintaining ethical evaluations that align with both AI advancements and human expectations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Elizabeth Clark (16 papers)
  2. Tal August (18 papers)
  3. Sofia Serrano (4 papers)
  4. Nikita Haduong (6 papers)
  5. Suchin Gururangan (29 papers)
  6. Noah A. Smith (224 papers)
Citations (332)