What Will it Take to Fix Benchmarking in Natural Language Understanding?

Published 5 Apr 2021 in cs.CL | (2104.02145v3)

Abstract: Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.

Abstract PDF Upgrade to Chat

Citations (144)

View on Semantic Scholar

Summary

The paper critiques current NLU benchmarks for their saturation and inability to differentiate truly superior models.
It introduces four key benchmark criteria—validity, reliable annotation, statistical power, and social bias detection—to guide future evaluations.
The authors advocate for hybrid data collection methods and auxiliary bias metrics to improve model assessment frameworks.

Improving Benchmarking in Natural Language Understanding: An Examination of Methodological Challenges

The paper "What Will it Take to Fix Benchmarking in Natural Language Understanding?" by Samuel R. Bowman and George E. Dahl presents a critical examination of the current landscape of benchmark evaluations in Natural Language Understanding (NLU). The authors assert that existing benchmarks are deficient in effectively measuring advancements in NLU. They propose four fundamental criteria that benchmarks should fulfill to be more effective: validity, reliable annotation, statistical power, and the ability to reveal harmful social biases.

The paper opens with a discussion on the saturation of current benchmark systems. The authors note that the near-ceiling performance on widely-used benchmarks like GLUE and SuperGLUE by models such as BERT highlights the inability of these systems to differentiate between truly superior models and those that have mastered benchmark-specific artifacts. Crucially, the authors critiqued the trend of using adversarial filtering and out-of-distribution test sets, suggesting that these methods obscure the very abilities that benchmarks are supposed to measure.

The authors articulate their criteria as follows:

Validity: Benchmarks should reflect the full spectrum of linguistic phenomena relevant to the task. This requires datasets to be free from annotation artifacts, representing a wide variation in linguistic constructions.
Reliable Annotation: Test examples should be accurately annotated to ensure consistency and clarity, removing ambiguity and incorrect labels to enhance the trustworthiness of model evaluations.
Statistical Power: Datasets need to be substantial in size and diversity to detect meaningful performance differences between models, especially as system performance approaches human-level accuracy.
Bias Identification: Benchmarks should expose harmful social biases within models and discourage the development of biased systems, employing auxiliary metrics to track these biases across relevant dimensions.

The authors highlight the challenges inherent in meeting these criteria, underscoring the limitations of standard data collection methods. They compare different approaches such as naturally occurring data distributions, expert-authored examples, crowdsourcing, and adversarial filtering, noting their respective shortcomings in creating valid and comprehensive datasets. Notably, they argue that crowdsourcing, despite its cost-effectiveness, often leads to datasets dominated by repetitive, easy cases, failing to test the intended phenomena thoroughly.

To address these challenges, the authors propose several research directions. These include hybrid data collection methods that involve both crowdworkers and domain experts and the implementation of robust data validation procedures. They also recommend the development and integration of auxiliary bias evaluation metrics attached to benchmarks to better identify and address social bias.

The implications of this work are significant for the NLU research community. While the paper does not offer immediate solutions to these complex issues, it calls for a deliberate and organized approach to designing benchmarks that truly reflect the competence and limitations of models. Improvements in benchmarks would not only enhance scientific progress but also ensure models are competent across a broader spectrum of language understanding tasks.

Looking forward, integrating the proposed criteria will likely require collaborative efforts across researchers to standardize and adopt new practices and benchmarks. This paper serves as a guiding framework for such advances, aspiring to restore a healthy and effective evaluation ecosystem in NLU that could ultimately support more reliable, unbiased, and comprehensive language understanding technologies.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (2)

Collections

Tweets

YouTube

Show All Videos

What Will it Take to Fix Benchmarking in Natural Language Understanding?

Summary

Improving Benchmarking in Natural Language Understanding: An Examination of Methodological Challenges

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections

Tweets

YouTube