An Evaluation of Human Assessments of Generated Text
The paper "All That's `Human' Is Not Gold: Evaluating Human Evaluation of Generated Text" presents a critical analysis of human evaluation processes applied to text generated by advanced natural language generation (NLG) models like GPT2 and GPT3. As the performance of NLG systems approaches human-like text fluency, the effectiveness of current human evaluation methods is brought into question. The authors conduct an empirical paper to assess whether non-expert evaluators can distinguish between human-authored and machine-generated text across different domains—stories, news articles, and recipes—and they explore strategies for improving evaluation accuracy.
The paper reveals that untrained evaluators are especially inept at distinguishing GPT3-generated text from human-authored text, with many evaluators performing at random chance levels. This outcome underscores significant hurdles in relying on non-expert human evaluations as the "gold standard" for quality assessment in automatically generated text. The results were consistent across various content domains, indicating a systemic issue with the present evaluation frameworks. Evaluators largely focused on superficial text features like grammar and style when attempting to differentiate between human and machine authorship, attributes which state-of-the-art models like GPT3 have largely mastered.
To remedy these evaluation challenges, the authors examined three evaluator training methodologies: providing detailed instructions, giving annotated examples, and offering comparison examples of human versus machine-written text. The findings show that while examples improved evaluators' capacity to identify machine-generated text, the improvement was not robust across all text domains. Thus, the authors argue for the need to abandon simplistic evaluation tasks done in small batches with little training and advise NLG researchers to adopt more comprehensive and interactive evaluation strategies.
One of the paper's key contributions is identifying an evaluator's reliance on the perceived limitations of machine-generated language, such as the machine's capability to express emotion or humor, as a misleading factor. Such misbeliefs often detract from an evaluator's ability to accurately judge the origin of a text and suggest a need for recalibrating evaluator expectations of machine generation capabilities.
The implications of this research are multifold. Practically, it advises against using non-expert, small-batch human evaluations for state-of-the-art NLG model outputs. Theoretically, it emphasizes reevaluating what constitutes human-likeness within generated text and motivates a shift towards assessment metrics that better capture text usefulness and content quality rather than superficial human-like characteristics. This shift is crucial as the applications of NLG systems grow and integrate more deeply into human societies.
Future directions could include refining evaluator training schemes to better address misconceptions about machine capabilities and looking into automated evaluation techniques that can complement human evaluations. This work calls into focus the need for robust methodologies to ensure the effective integration of NLG systems in real-world applications while maintaining ethical evaluations that align with both AI advancements and human expectations.