Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP (2305.01633v2)

Published 2 May 2023 in cs.CL

Abstract: We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

Citations (32)

View on Semantic Scholar

Collections

Summary

The paper finds that only 13% of reviewed human evaluation studies are easily reproducible due to critical missing information and unresponsive authors.
The study employs a systematic annotation of 177 papers to reveal widespread experimental flaws in human evaluation methodologies.
The research calls for standardized reporting practices and detailed documentation to improve reproducibility in NLP evaluations.

Evaluating the Reproducibility of Human Evaluations in NLP

This paper provides an analysis of the complex landscape of reproducibility in human evaluations within the field of NLP. The authors critically assess the feasibility of reproducing human evaluations that have been reported in previous research, highlighting numerous shortcomings and methodological challenges encountered in attempting to reproduce these studies.

Background and Motivation

The reproducibility of results in scientific research, and particularly in NLP, is a topic of growing concern. Reproducible research is essential for verifying results and validating claims, ultimately contributing to the reliability of findings. While there have been significant efforts to standardize automatic evaluations, human evaluations have lagged, despite being considered a gold standard for system quality assessment.

Human evaluations are pivotal for benchmarking NLP systems, yet there is a lack of standardized reporting methodologies and evaluation practices. This paper falls under the ReproHum project, which aims to enhance practices around human evaluations by analyzing what factors contribute to their reproducibility.

Methodology

The authors set out to examine existing human evaluations in NLP to understand how reproducible they are. This involved an extensive search and annotation of 177 papers containing human evaluation studies, with a focus on acquiring comprehensive experimental details necessary for reproduction.

The authors devised a structured process, starting with a high-level annotation of papers, followed by detailed annotation at the level of specific evaluation experiments. They sought explicit information from authors about specifics like evaluation criteria, participant details, and methodologies used, but faced considerable challenges due to missing information or lack of author responsiveness.

Findings and Challenges

One of the critical findings was the severe lack of informativeness and standardization across existing human evaluation studies. Only 13% of the reviewed studies had sufficiently low barriers to reproduction, highlighting that a large majority were not readily reproducible. Here are some significant challenges outlined:

Missing Information: Many papers lacked essential details, such as evaluation criteria definitions, participant numbers, and specifics about the tasks which were critical for reproduction.
Unresponsive Authors: A large portion of authors did not respond to requests for additional information, significantly hindering the reproducibility efforts.
Experimental Flaws: During attempts to replicate studies, several methodological flaws were uncovered, further questioning the validity of these past evaluations.

These issues culminated in the realization that previously recorded human evaluations could not reliably demonstrate reproducibility due to fundamental flaws and missing data.

Implications and Future Work

The paper emphasizes the urgent need for the NLP community to adopt standardized methodologies for designing and reporting human evaluations. This includes maintaining detailed documentation of evaluation processes, as well as developing community-driven repositories or templates to ensure that critical experimental details are consistently recorded.

The authors propose a move towards standardization and common reporting practices, such as using datasheet templates for evaluations, to facilitate future reproducibility endeavors.

Conclusion

In conclusion, the paper highlights a substantial obstacle in verifying the reliability of human evaluations in NLP due to missing information and methodological discrepancies. Despite the negative outlook on current practices, it opens the door for rethinking and improving the design and documentation of human evaluations to enhance reproducibility in the field. The paper acts as a call to action for researchers to embrace open science practices and contribute to a more robust framework for human evaluations in NLP systems.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (42)

First 10 authors: