HellaSwag: Can a Machine Really Finish Your Sentence? (1905.07830v1)

Published 19 May 2019 in cs.CL

Abstract: Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Citations (1,745)

View on Semantic Scholar

Summary

The paper introduces a novel adversarial filtering approach that creates a challenging commonsense NLI dataset where state-of-the-art models score below 50%.
It shows that while humans achieve 95.6% accuracy, models like BERT-Large perform at just 47.3%, underscoring a substantial gap in reasoning capabilities.
Ablation studies confirm that increased dataset complexity directly impedes model performance, highlighting the need for innovative architectures beyond surface-level learning.

Can a Machine Really Finish Your Sentence?

Introduction

The paper "Can a Machine Really Finish Your Sentence?" authored by Zellers et al. addresses the challenge of commonsense natural language inference (NLI), specifically whether contemporary machine learning models can equal human-level commonsense reasoning. The paper introduces a new dataset, which significantly raises the bar for evaluating such models. The prior SWAG dataset showed substantial progress but was unexpectedly solved by robust pre-trained models, notably BERT. The new dataset and adversarial filtering (AF) paradigm discussed in this paper are pertinent for evolving our understanding of machine commonsense reasoning.

Dataset and Methodology

The investigation begins by introducing a new dataset that extends the scope of commonsense NLI to include more complex and diversified contexts, such as those from WikiHow articles and ActivityNet captions. The dataset examples are designed such that while they remain trivial for humans (95.6% accuracy), they remain a significant hurdle for current state-of-the-art models (less than 50% accuracy). This is achieved through Adversarial Filtering (AF), which iteratively employs a series of discriminators to select machine-generated wrong answers that are difficult for models to classify correctly.

AF was applied using state-of-the-art discriminators, such as BERT, in conjunction with high-quality adversarial generators like OpenAI's GPT. The key insight was to maximize the length and complexity of examples to a critical 'Goldilocks' zone where generated text appears plausible to humans but is incorrectly classified by machines. Such a setup ensures that the resultant dataset remains a formidable benchmark for evaluating the current and future models' commonsense NLI capabilities.

Key Results

Performance Metrics: Even with extensive fine-tuning, BERT and other models achieved an accuracy of less than 50% on the new dataset. BERT-Large performed best among the tested models with an accuracy of just 47.3%. Notably, human accuracy was 95.6%, highlighting a substantial performance gap between humans and current models.
Transfer Learning: Experiments showed limited generalization ability when models pretrained on SWAG were evaluated on the new dataset. BERT-Large achieved just 34.6% accuracy when transferred directly from SWAG to the new dataset, demonstrating the new benchmark's difficulty.
Ablation Studies: The paper demonstrated that the new dataset’s difficulty stems from its complexity. When biases and simpler structures (like single sentences) were introduced, models performed better. However, with the designed adversarial filtering using complex contextual structures, model performance plummeted, emphasizing the robustness of the new test.

Implications and Future Directions

The implications of this work are twofold: practical and theoretical. Practically, this new dataset sets a harder benchmark for NLI systems and reinforces the need for datasets that co-evolve with algorithmic advancements. Theoretically, the demonstrated limitations of current models in handling true commonsense reasoning underline the need for novel model architectures and learning paradigms.

Future Directions:

Improvement in Pretraining: Despite the current day's success using extensive pretraining, the paper highlights a point of diminishing returns, suggesting that merely increasing the model size and pretraining resources will not suffice for human-level performance in NLI.
Architectural Innovations: There is a compelling need for frameworks that go beyond surface-level learning and leverage more abstract representations of world knowledge.
Dynamic Benchmarks: As the authors suggest, evolving benchmarks will be critical. Benchmarks must continuously evolve by integrating adversarial examples that challenge the current state-of-the-art.

Conclusion

The paper by Zellers et al. has created a formidable challenge for the natural language processing community with its adversarially filtered, context-rich dataset. It underscores the significant gap between current model capabilities and human-level commonsense reasoning. Despite the prowess of models like BERT, the paper convincingly shows that we have a long way to go before machines can "really finish your sentence" with the same understanding as humans. This work strikes a balance between revealing the capabilities of today's models and setting the stage for future advancements in NLP.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Tony_Bamboni/status/1767506145806295358

YouTube

Show All Videos