- The paper critiques current NLU benchmarks for their saturation and inability to differentiate truly superior models.
- It introduces four key benchmark criteria—validity, reliable annotation, statistical power, and social bias detection—to guide future evaluations.
- The authors advocate for hybrid data collection methods and auxiliary bias metrics to improve model assessment frameworks.
Improving Benchmarking in Natural Language Understanding: An Examination of Methodological Challenges
The paper "What Will it Take to Fix Benchmarking in Natural Language Understanding?" by Samuel R. Bowman and George E. Dahl presents a critical examination of the current landscape of benchmark evaluations in Natural Language Understanding (NLU). The authors assert that existing benchmarks are deficient in effectively measuring advancements in NLU. They propose four fundamental criteria that benchmarks should fulfill to be more effective: validity, reliable annotation, statistical power, and the ability to reveal harmful social biases.
The paper opens with a discussion on the saturation of current benchmark systems. The authors note that the near-ceiling performance on widely-used benchmarks like GLUE and SuperGLUE by models such as BERT highlights the inability of these systems to differentiate between truly superior models and those that have mastered benchmark-specific artifacts. Crucially, the authors critiqued the trend of using adversarial filtering and out-of-distribution test sets, suggesting that these methods obscure the very abilities that benchmarks are supposed to measure.
The authors articulate their criteria as follows:
- Validity: Benchmarks should reflect the full spectrum of linguistic phenomena relevant to the task. This requires datasets to be free from annotation artifacts, representing a wide variation in linguistic constructions.
- Reliable Annotation: Test examples should be accurately annotated to ensure consistency and clarity, removing ambiguity and incorrect labels to enhance the trustworthiness of model evaluations.
- Statistical Power: Datasets need to be substantial in size and diversity to detect meaningful performance differences between models, especially as system performance approaches human-level accuracy.
- Bias Identification: Benchmarks should expose harmful social biases within models and discourage the development of biased systems, employing auxiliary metrics to track these biases across relevant dimensions.
The authors highlight the challenges inherent in meeting these criteria, underscoring the limitations of standard data collection methods. They compare different approaches such as naturally occurring data distributions, expert-authored examples, crowdsourcing, and adversarial filtering, noting their respective shortcomings in creating valid and comprehensive datasets. Notably, they argue that crowdsourcing, despite its cost-effectiveness, often leads to datasets dominated by repetitive, easy cases, failing to test the intended phenomena thoroughly.
To address these challenges, the authors propose several research directions. These include hybrid data collection methods that involve both crowdworkers and domain experts and the implementation of robust data validation procedures. They also recommend the development and integration of auxiliary bias evaluation metrics attached to benchmarks to better identify and address social bias.
The implications of this work are significant for the NLU research community. While the paper does not offer immediate solutions to these complex issues, it calls for a deliberate and organized approach to designing benchmarks that truly reflect the competence and limitations of models. Improvements in benchmarks would not only enhance scientific progress but also ensure models are competent across a broader spectrum of language understanding tasks.
Looking forward, integrating the proposed criteria will likely require collaborative efforts across researchers to standardize and adopt new practices and benchmarks. This paper serves as a guiding framework for such advances, aspiring to restore a healthy and effective evaluation ecosystem in NLU that could ultimately support more reliable, unbiased, and comprehensive language understanding technologies.