SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (1905.00537v3)

Published 2 May 2019 in cs.CL and cs.AI

Abstract: In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. SuperGLUE is available at super.gluebenchmark.com.

PDF Abstract

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

The paper "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems" introduces SuperGLUE, an evolution of the GLUE benchmark designed to more rigorously gauge the capabilities of general-purpose NLP models. SuperGLUE addresses the diminishing headroom for improvement in GLUE tasks by introducing new, more challenging tasks that better reflect the current state of the art in language understanding.

Introduction

In recent years, self-supervised learning from large unlabelled text corpora, coupled with effective model adaptation techniques, has led to substantial performance improvements in NLP tasks. Early benchmarks like GLUE encapsulated this progress, providing a single-number performance metric over a suite of tasks. However, the rapid advancement in methods—exemplified by models such as ELMo, OpenAI GPT, and BERT—has quickly led to scores that surpass human performance on GLUE's tasks. SuperGLUE was introduced to re-establish a rigorous metric for evaluating general-purpose language understanding systems by incorporating more challenging and diverse tasks.

SuperGLUE Benchmark Design

SuperGLUE maintains the core design principles of GLUE but introduces enhancements to challenge modern NLP models:

Challenging Tasks: SuperGLUE includes eight tasks, two carried over from GLUE and six newly introduced tasks deemed difficult for current models. These tasks were identified through an open call for proposals and were chosen based on their difficulty for existing NLP systems.
Diverse Task Formats: SuperGLUE expands beyond sentence and sentence-pair classification to include other formats like coreference resolution and question answering (QA).
Comprehensive Human Baselines: Human performance metrics are included for all tasks to ensure that substantial headroom exists between a robust BERT-based baseline and human performance.
Improved Code Support: A modular toolkit built on PyTorch and AllenNLP supports working with the benchmark, facilitating pretraining, multi-task, and transfer learning.
Refined Usage Rules: Revised conditions for leaderboard inclusion ensure fair competition and proper credit assignment to data and task creators.

Tasks in SuperGLUE

SuperGLUE includes eight tasks designed to test a broad spectrum of language understanding:

BoolQ: Yes/no questions about Wikipedia passages.
CB (CommitmentBank): Textual entailment task with three classes, based on embedded clauses in dialogues.
COPA: Causal reasoning with binary classification, handcrafted from blogs and descriptive text.
MultiRC: Multiple correct answers for questions about a passage, requiring integrated text comprehension.
ReCoRD: Cloze-style QA task demanding commonsense reasoning, based on news articles.
RTE: Classic Recognizing Textual Entailment task, merged from multiple RTE datasets.
WiC: Word sense disambiguation, requiring context-based lexical matching.
WSC (Winograd Schema Challenge): Pronoun resolution demanding real-world knowledge.

Experimental Results

BERT-based models were evaluated on these tasks to provide baseline performance metrics:

Performance: The BERT-based models, even with additional training on related datasets (BERT++), show significant performance gaps compared to human baselines. For instance, the average gap between BERT++ and human performance across tasks is about 20 points.
Individual Task Gaps: The largest disparity is seen in WSC, with a 35-point gap, while tasks like BoolQ, CB, RTE, and WiC show closer competition, with around 10-point differences.
Diagnostics: SuperGLUE retains GLUE's diagnostic dataset, evaluating models on a broader range of linguistic phenomena. Despite advances, models still lag notably behind human performance in these diagnostic tasks.

Implications and Future Directions

SuperGLUE sets a new bar for NLP model evaluation, emphasizing the need for innovations in sample-efficient learning, transfer learning, multitask learning, and unsupervised/self-supervised learning methods. Successful models on SuperGLUE are expected to demonstrate strengths across diverse and challenging tasks, pushing the envelope of what is possible with general-purpose language understanding systems.

Artifact-wise, the toolkit provided with SuperGLUE, leveraging cutting-edge frameworks like PyTorch and AllenNLP, ensures that researchers have robust tools to benchmark and improve their models. The implications of these advancements extend beyond academic benchmarks, potentially enriching practical applications in natural language understanding.

Conclusion

The introduction of SuperGLUE marks a critical step in contextual benchmarking within NLP, ensuring that model evaluations remain relevant and challenging in light of rapid advancements. This benchmark fosters continued innovation by providing a more rigorous and diverse set of tasks than its predecessor, GLUE, thus guiding the next wave of research in LLMing and understanding.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Alex Wang (32 papers)
Yada Pruksachatkun (12 papers)
Nikita Nangia (17 papers)
Amanpreet Singh (36 papers)
Julian Michael (28 papers)
Felix Hill (52 papers)
Omer Levy (70 papers)
Samuel R. Bowman (103 papers)

Citations (2,123)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/rodjnaquin/status/1803594709903757391

YouTube

Show All Videos