HellaSwag Benchmark

Updated 19 October 2025

HellaSwag Benchmark is a dataset for commonsense NLI, featuring multi-sentence contexts and adversarially generated distractor endings.
It employs an iterative adversarial filtering process that refines candidate endings to resist shortcut learning by language models.
Performance gaps between humans and models highlight its value for assessing reasoning, generalization, and robustness in NLP systems.

HellaSwag is a benchmark designed to probe commonsense natural language inference (NLI) and to rigorously evaluate the ability of neural models to perform plausible sentence completion. Initially motivated by limitations discovered in the SWAG dataset, HellaSwag exposes striking gaps between human and machine performance, using adversarially filtered distractors that are trivial for humans to reject but remain challenging for state-of-the-art LLMs. Its structure—multi-sentence contextual prompts with plausible multi-sentence endings—facilitates the diagnosis of models’ reliance on superficial data artifacts versus genuine world knowledge and reasoning capability.

1. Adversarial Filtering and Dataset Construction

HellaSwag’s most distinctive feature is its Adversarial Filtering (AF) collection protocol. This iterative procedure synthesizes a large pool of machine-generated distractor endings for each prompt (via pretrained LLMs). The AF loop operates as follows:

Splitting the candidate pool into a training set (𝒟_train) and a dummy test set (𝒟_test).
Training discriminator models (e.g., BERT) to separate authentic, human-written targets from automated distractors on 𝒟_train.
Identifying “easy” negatives in 𝒟_test—those that the classifier can reliably categorize.
Replacing those easy negatives and regenerating adversarial candidates until the final test split reaches a fixed point of model confusion.

This process is formalized by:

$L = -\sum_i \log P(\text{real}_i\,|\,\text{context}_i, \text{ending}_i)$

and candidate replacement operates whenever:

$D(\text{context}, \text{candidate}) > \text{threshold}$

AF ensures the dataset resists shortcut exploitation and continually reflects the state-of-the-art. Contexts for HellaSwag are sourced from domains with longer, more information-rich structures—primarily WikiHow articles and ActivityNet video captions—yielding context passages averaging three sentences followed by two-sentence endings.

2. Core Dataset Properties

HellaSwag’s examples are characterized by:

Extended context and answer lengths (three-sentence context, two-sentence endings).
Diversity of topic selection (e.g., “how to survive a bank robbery,” “making an origami owl”).
Category split into in-domain and zero-shot samples, testing a model’s ability to generalize to held-out topics.
Human accuracy exceeding 95%, with models such as BERT-Large achieving only ~47%—demonstrating the task’s discriminatory power.
Multi-stage manual annotation to maximize the clarity and correctness of positive examples as well as adversarial filtering of negatives.

The design intentionally places negative distractors in a “Goldilocks zone”: machine-generated endings that are semantically relevant yet grammatically odd or contextually implausible to humans.

3. Model Performance, Generalization, and Efficiency

Table: Model Accuracies on HellaSwag versus Human Performance

Model	HellaSwag Accuracy	Human Accuracy
BERT-Large	~47%	>95%
ESIM+ELMo	<<59%	>95%
UNICORN (T5-11B)	93.9%	>95%

Machine performance prior to massive scaling and architectural innovation falls far short of human baselines. For example, SWAG-trained models generalize poorly to HellaSwag and vice versa, emphasizing the unique difficulty induced by adversarial filtering. Ablation studies reveal that performance drops only modestly when word order is disrupted or context removed, indicating reliance on superficial cues.

UNICORN, based on T5-11B, demonstrates that sequential multitask transfer learning—pretraining on diverse commonsense benchmarks (RAINBOW), then fine-tuning on HellaSwag—can nearly close the gap (93.9% accuracy). The cost equivalent curve formalism illustrates dramatic data efficiency gains for transfer learning:

$g(n) = f_t^{-1}(f_c(n))$

where $g(n)$ quantifies how many training samples the transfer method needs to match single-task baseline performance $f_c(n)$ .

4. Validity Concerns and Benchmark Auditing

Recent investigations have exposed several validity issues undermining HellaSwag’s status as a diagnostic of commonsense reasoning (Chizhov et al., 10 Apr 2025):

Up to 40% of prompts are ungrammatical; the ActivityNet subset in particular suffers (95.7% ungrammatical).
Synthetic distractors possess lower grammaticality and more errors, biasing models toward correct answers via likelihood maximization.
Multiple correct or equally plausible answer choices occur in >21% of questions; some questions offer no valid answer.
Zero-prompt experiments—where only answers or even generic inputs (“Lorem ipsum…”) are given—yield identical predictions in ~68% of cases, indicating model reliance on answer selection cues over deep reasoning:

$\mathcal{L} = \frac{1}{|\mathcal{V}|}\sum_{t\in\mathcal{V}} \log P(y_t\,|\,x_{<t})$

These limitations motivated the creation of GoldenSwag, a filtered subset (1,525 out of 10,042 items), which removes ambiguous, ungrammatical, or otherwise compromised examples.

5. Robustness and Linguistic Variability

Systematic studies of HellaSwag robustness reveal pronounced sensitivity to input paraphrasing (Lunardi et al., 4 Sep 2025, Li et al., 17 Feb 2025):

Generating five paraphrased variants per question results in consistent accuracy drops ( $\Delta = Acc_0 - Acc_p$ ).
Across paraphrases, only 70–85% of questions are answered identically; 15–30% receive inconsistent answers.
Model ranking is stable (Kendall’s $\tau > 0.9$ ), but raw accuracy drops expose nontrivial vulnerability to surfacing variability.
Robustness evaluation metrics (Original Accuracy, OA; Average Robust Accuracy, ARA; Robust Loss Accuracy, RLA; Consistent Robust Accuracy, CRA), defined as

$\text{OA} = \frac{\sum_{(x,y)\in\mathcal{D}} \mathbf{1}[LM(x)=y]}{|\mathcal{D}|}$

$\text{ARA} = \frac{\sum_{(x',y')\in\mathcal{D}_r} \mathbf{1}[LM(x')=y']}{|\mathcal{D}_r|}$

$\text{RLA} = \text{OA} - \text{ARA}$

$\text{CRA} = \frac{\sum_{(x',y')\in\mathcal{D}_r} \mathbf{1}[LM(x)=y]\cdot\mathbf{1}[LM(x')=y']}{|\mathcal{D}_r|}$

quantify degradation when faced with linguistic perturbations. Some variant types—especially negative transformation and scenario refinement—are particularly adversarial. HellaSwag-Pro (Li et al., 17 Feb 2025) leverages bilinguality and question variation to probe true model robustness; performance gaps persist regardless of language and prompting strategies.

6. Compression, Distillation, and Benchmark Profiling

Recent research has focused on compressing benchmarks like HellaSwag for efficient evaluation (Kipnis et al., 4 Jul 2024, Wang et al., 12 Oct 2025, Kim et al., 23 Sep 2025):

metabench (Kipnis et al., 4 Jul 2024) applies item-response theory (e.g., 3PL model $f(\theta_j\,|\,\alpha_i) = \sigma(a_i\theta_j - \delta_i)$ , Fisher information $I_i(\theta) = a_i^2 f(\theta\,|\,\alpha_i)(1-f(\theta\,|\,\alpha_i))$ ) and cross-validated subsampling to distill HellaSwag items to <3% of total, reconstructing scores with <1.24% RMSE.
EssenceBench (Wang et al., 12 Oct 2025) uses iterative genetic algorithms and sample attribution scores to preserve model rankings (within 5%) while reducing data requirements by up to 200×.
Benchmark profiling (Kim et al., 23 Sep 2025) employs gradient-based importance scoring and targeted parameter ablation to decompose HellaSwag performance into ten cognitively grounded abilities. The Ability Impact Score (AIS):

$I_j^{(a)}(\theta) \approx |\frac{\partial L(D_a, \theta)}{\partial \theta_j} \cdot \theta_j|$

$\text{AIS}_b^{(a)} = \frac{P_b(\Theta) - P_b(\Theta^{(a)})}{P_b(\Theta) - P_b^{chance}}$

dispels the notion that HellaSwag measures “pure” commonsense; instead, a distributed ability footprint is implicated (e.g., causal reasoning, contextual recall, analogical reasoning).

7. Future Directions and Research Implications

The evolution of HellaSwag affirms several directions for research and evaluation:

Continued adversarial filtering is necessary to maintain the diagnostic utility of benchmarks as models improve.
Benchmark co-evolution (generating new datasets that dynamically challenge state-of-the-art architectures) serves to monitor true language understanding progress.
The community must address construct validity issues, embracing rigorous filtering and multifactorial audit protocols (e.g., GoldenSwag) to eliminate scoring artifacts.
Robustness-aware and multilingual evaluation is essential for reliable deployment in practical scenarios, avoiding overfitting to static formulations.
Compression and distillation techniques enable scalable model assessment with minimal computation, as demonstrated by metabench and EssenceBench.

HellaSwag remains a foundational diagnostic for LLM commonsense inference; however, recent studies complicate its interpretative clarity and encourage ongoing innovation in benchmark design, auditing, and data-driven robustness evaluation.