Hypothesis-Only Classifier in NLI

Updated 6 March 2026

Hypothesis-only classifiers are models that predict NLI labels solely from hypothesis text, highlighting key annotation artifacts.
They reveal that surface cues and statistical regularities in hypotheses can yield accuracies up to 96% on certain datasets.
Debiasing strategies such as adversarial training and data filtering are employed, yet residual biases often persist in model evaluations.

A hypothesis-only classifier is a model for Natural Language Inference (NLI) that predicts the entailment relation using solely the hypothesis sentence, without access to the paired premise. This diagnostic tool is essential for quantifying annotation artifacts—spurious statistical regularities or surface cues in the hypothesis—that enable label prediction without true premise-hypothesis reasoning. Across standard, crowd-sourced, and even LLM-elicited NLI datasets, hypothesis-only classifiers routinely achieve accuracies far exceeding majority-class baselines, exposing deep-rooted biases in dataset construction and threatening the validity of benchmark-driven evaluation.

1. Formal Definition and Diagnostic Role

Let $H$ denote the space of hypothesis sentences and $Y = \{\text{entailment}, \text{neutral}, \text{contradiction}\}$ the NLI label set. A hypothesis-only classifier is a function $f: H \rightarrow Y$ trained and evaluated by mapping each hypothesis $h$ to a predicted label $\hat y = f(h)$ , deliberately disregarding the paired premise $p$ (Proebsting et al., 2024, Poliak et al., 2018). If $f(h) \approx y$ holds for a significant fraction of test examples, the dataset permits label prediction via annotation artifacts. Such artifacts include lexical items, phraseology, or syntactic patterns highly correlated with specific labels, as operationalized via measures such as $p(\text{label}|w)$ or more elaborate pattern-induced splits (Liu et al., 2020, Hu et al., 2021).

The fundamental diagnostic function of hypothesis-only classifiers is to reveal the extent to which dataset design, annotation protocol, or LLM generation introduces superficial statistical regularities that undermine the premise-hypothesis reasoning objective of NLI tasks.

2. Model Architectures and Training Protocols

A wide array of architectures is applied for hypothesis-only classification:

Shallow Baselines: Naive Bayes models over case-sensitive unigram counts serve as interpretable baselines for quantifying surface-level artifact exploitation (Proebsting et al., 2024).
BiLSTM Encoders: Hypothesis tokens are embedded (e.g., 300-dim fixed GloVe), passed through BiLSTMs, with max-pooling over time yielding a fixed-length vector, further classified by an MLP (Poliak et al., 2018, Belinkov et al., 2019).
Transformer-based Models: Pretrained BERT-base models are fine-tuned (unmodified lightweight classification head) on hypotheses, optimizing standard cross-entropy over NLI labels with AdamW (learning rate $2 \times 10^{-5}$ , batch size 16, one epoch) (Proebsting et al., 2024).

Training involves only hypotheses and their labels. Majority-class (random guess) baselines for three-way NLI tasks are typically $33\%$ to $34\%$ , but hypothesis-only models achieve $67\%-69\%$ on SNLI, and up to $96\%$ on LLM-generated datasets, barring explicit artifact mitigation (Proebsting et al., 2024, Poliak et al., 2018, Liu et al., 2020).

3. Empirical Evidence of Hypothesis-Only Bias

Quantitative experiments consistently demonstrate substantial hypothesis-only signal:

Dataset/Method	Majority Baseline	Hyp-Only Accuracy	Max with Naive Bayes/BERT
SNLI (crowdsourced)	33–34%	67–69%	—
SNLI, LLM-elicited (BERT)	33%	86–96%	>90% (NB, 20 features)
SPR (recast, binary)	65%	86%	—

LLM-Elicited Data: BERT-based models trained on GPT-4, Llama, or Mistral-elicited SNLI reach up to 96% accuracy, while even a Naive Bayes classifier using 20 most-biased features achieves over 90% (Proebsting et al., 2024).
Crowdsourced Datasets: On SNLI, hypothesis-only models surpass 67% accuracy (Poliak et al., 2018); in the SPR recast, proto-role cues inflate accuracies +8–10 points over the already-skewed majority baseline (Hu et al., 2021).
Pattern Probing: In “HypoNLI,” pattern-based probes classify “easy” and “hard” instances, revealing that for hypotheses containing certain patterns, models achieve up to 97% accuracy, but on others (“hard”) only 59%, even in full premise-hypothesis models (Liu et al., 2020).

These results establish that hypothesis-only baselines far outperform naive baselines whenever the data contain label-artefact correlations.

4. Nature and Detection of Annotation Artifacts

Annotation artifacts in NLI datasets manifest primarily as surface patterns within the hypothesis strongly predictive of the label:

Lexical Give-Aways: Unigram or multi-word phrases with $p(\text{label}|w) \geq 0.8$ or higher (e.g., “sleeping,” “Nobody,” or “swimming in a pool”), responsible for 15% of SNLI hypotheses being perfectly diagnostic (Proebsting et al., 2024, Poliak et al., 2018).
Proto-Role and Semantic Biases: In recast SPR, certain proto-roles (e.g., “stationary”) have $>96\%$ correlation with a single label (Hu et al., 2021).
Context-Free Properties: Grammaticality, hypothesis length, or lexical semantics often carry label information independent of the paired premise (Poliak et al., 2018).

Statistical tests such as $\chi^2$ over word-label co-occurrence tables, or explicit computation of $p(\text{label}|w)$ for top-k words, systematically uncover these artifacts.

5. Debiasing Strategies and Mitigation Algorithms

Mitigating hypothesis-only bias in NLI employs both data-level and model-level interventions:

Adversarial (Gradient Reversal) Training: An adversarial discriminator is trained to predict the label from the hypothesis-only representation; encoder updates seek to “fool” this discriminator through gradient reversal. Combined objectives weigh NLI and adversarial loss. While adversarial training (AdvCls, AdvDat) reduces hypothesis-only accuracy by up to 9 points, substantial residual bias remains, particularly in high-dimensional encoders (Belinkov et al., 2019, Stacey et al., 2020).
Ensemble Adversarial Approaches: Multiple adversarial branches ( $K=5$ to $K=20$ ) are simultaneously optimized, more effectively erasing artifact signals in high-capacity sentence encoders. In “Avoiding the Hypothesis-Only Bias in NLI via Ensemble Adversarial Training,” a 2048-d encoder requires $K=20$ adversaries to reach minimum re-learned bias (Stacey et al., 2020).
Data Filtering and Down-Sampling: Removing all training hypotheses with strong pattern-label associations, as identified by thresholding $p(\text{label}|b) \geq \lambda$ , sharply reduces the “easy/hard” gap. Down-sampling eliminates up to $17\%$ of the SNLI data but also closes the artifact-driven accuracy gap by over 8 points (Liu et al., 2020).
Pattern-Guided Reweighting: In adversarially trained models, over-weighting “biased” instances for the discriminator and under-weighting in the main classifier further enhances debiasing (Liu et al., 2020).
Control in LLM Data Creation: For LLM-elicited NLI data, mitigation recommendations include human-in-the-loop filtering, prompt diversification, adversarial/rebalanced post-processing, and explicit LLM instructions to avoid prompt-mirroring (Proebsting et al., 2024).

6. Impact on Evaluation, Benchmarking, and Applications

The prevalence of hypothesis-only artifacts fundamentally challenges the interpretation of high test accuracy on standard NLI datasets. When premise-ignorant models predict labels with high fidelity, benchmarks measure annotation-artefact exploitation rather than semantic inference. Empirical evidence suggests that even after adversarial removal techniques, bias may persist in the embedding geometry or word representations (Hu et al., 2021, Belinkov et al., 2019).

These findings motivate explicit reporting of hypothesis-only baselines alongside majority-class baselines in all future NLI work and dataset releases (Poliak et al., 2018, Hu et al., 2021). Model improvements in NLI should be judged relative to artifact-free splits or rigorously debiased benchmarks. Hypothesis-only evaluation has also been proposed for extension to other pairwise classification tasks, such as multimodal reasoning or question answering.

7. Limitations and Open Problems

Despite progress in adversarial debiasing, several issues remain unresolved:

Residual Hidden Bias: Even with multi-adversary or pattern-aware debiasing, retraining hypothesis-only probes on frozen encoders often recovers considerable predictive signal ( $>50\%$ accuracy), indicating that current methods may only hide but not erase artifacts (Stacey et al., 2020, Belinkov et al., 2019).
Embedding-Level Artifacts: Fixed pretrained embeddings (e.g., GloVe) already encode much of the bias, limiting the effectiveness of downstream removal (Belinkov et al., 2019).
Trade-off with NLI Accuracy: Strong debiasing often coincides with absolute drops in NLI accuracy, especially for adversarial data replacement, requiring delicate hyper-parameter tuning.
Dataset Construction: Both crowdsourced and LLM-elicited data generation protocols are highly susceptible to structural annotation artifacts, necessitating future refinement in instruction protocols, prompt engineering, and statistical post-filtering (Proebsting et al., 2024).
Adversary Complexity: The expressive power of adversarial branches must match the class of attacks likely to be deployed, e.g., linear adversaries are more effective at removing linear-probe bias, but may miss non-linear correlations (Stacey et al., 2020).

A plausible implication is that optimal debiasing for NLI may demand advances in representation learning, prompt design, and continuous monitoring of artifact induction throughout dataset lifecycles.

References:

"Hypothesis-only Biases in LLM-Elicited Natural Language Inference" (Proebsting et al., 2024)
"Hypothesis Only Baselines in Natural Language Inference" (Poliak et al., 2018)
"Avoiding the Hypothesis-Only Bias in Natural Language Inference via Ensemble Adversarial Training" (Stacey et al., 2020)
"On Adversarial Removal of Hypothesis-only Bias in Natural Language Inference" (Belinkov et al., 2019)
"HypoNLI: Exploring the Artificial Patterns of Hypothesis-only Bias in Natural Language Inference" (Liu et al., 2020)
"Exploring Lexical Irregularities in Hypothesis-Only Models of Natural Language Inference" (Hu et al., 2021)