Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Can We Learn from Collective Human Opinions on Natural Language Inference Data? (2010.03532v2)

Published 7 Oct 2020 in cs.CL, cs.AI, and cs.LG

Abstract: Despite the subjective nature of many NLP tasks, most NLU evaluations have focused on using the majority label with presumably high agreement as the ground truth. Less attention has been paid to the distribution of human opinions. We collect ChaosNLI, a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS in oft-used NLI evaluation sets. This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive-NLI. Analysis reveals that: (1) high human disagreement exists in a noticeable amount of examples in these datasets; (2) the state-of-the-art models lack the ability to recover the distribution over human labels; (3) models achieve near-perfect accuracy on the subset of data with a high level of human agreement, whereas they can barely beat a random guess on the data with low levels of human agreement, which compose most of the common errors made by state-of-the-art models on the evaluation sets. This questions the validity of improving model performance on old metrics for the low-agreement part of evaluation datasets. Hence, we argue for a detailed examination of human agreement in future data collection efforts, and evaluating model outputs against the distribution over collective human opinions. The ChaosNLI dataset and experimental scripts are available at https://github.com/easonnie/ChaosNLI

Citations (114)

Summary

  • The paper reveals that collective human disagreement in NLI challenges the conventional single-label evaluation by exposing substantial annotation divergences.
  • The study demonstrates that models like BERT, RoBERTa, and XLNet excel on high-agreement examples but perform near-randomly on low-consensus cases.
  • The paper employs Jensen-Shannon and Kullback-Leibler divergence metrics to highlight the gap between model predictions and diverse human opinions, urging a paradigm shift in evaluation protocols.

Analysis of Collective Human Opinions in NLI Tasks

The paper entitled "What Can We Learn from Collective Human Opinions on Natural Language Inference Data?" presents a novel investigation into the role of human disagreement in the domain of Natural Language Inference (NLI), challenging traditional evaluation practices in NLP. This paper derives from the recognition that NLI tasks inherently exhibit subjective interpretations, often contrary to the singular, majority-ground-truth label assumed in conventional assessments.

The authors introduce ChaosNLI, a comprehensive dataset comprising 464,500 human judgments across existing NLI benchmarks, including SNLI, MNLI, and an additional evaluation set denoted as .Eachexampleinthisdatasetisannotatedwith100independentlabels,offeringarobustrepresentationofthedistributionofhumanopinions.</p><h3class=paperheading>KeyFindings</h3><ol><li><strong>PrevalenceofHumanDisagreement</strong>:Theanalysisindicatessubstantialhumandisagreementinasignificantportionofexampleswithintheexamineddatasets.Theresultsshowdeviationfrommajorityagreedlabels,with10. Each example in this dataset is annotated with 100 independent labels, offering a robust representation of the distribution of human opinions.</p> <h3 class='paper-heading'>Key Findings</h3> <ol> <li><strong>Prevalence of Human Disagreement</strong>: The analysis indicates substantial human disagreement in a significant portion of examples within the examined datasets. The results show deviation from majority-agreed labels, with 10%, 25%, and 30% misalignments identified within ChaosNLI-, ChaosNLI-S, and ChaosNLI-M, respectively.

  • Model Performance and Human Agreement: Contemporary state-of-the-art models (e.g., BERT, RoBERTa, XLNet) fail to adequately capture the distribution of human opinions across low-agreement examples. These models perform markedly well on high-agreement subsets, yet their accuracy plummets on data points with extensive human disagreement, often bordering random guess performance.
  • Assessment of Ensemble Measures: Using Jensen-Shannon Distance (JSD) and Kullback-Leibler (KL) divergence metrics, the paper reveals a stark disparity between model predictions and actual human judgment distributions, suggesting that current methodologies inadequately address human cognitive diversity in language comprehension.
  • Implications for NLP Evaluation

    This research posits that traditional NLI evaluation metrics, which focus on model alignment with ground-truth labels, might overlook critical nuances of human inference. The findings advocate for a paradigm shift towards assessing models against the broader spectrum of human opinions, thus presenting a more holistic reflection of model capabilities in aligning with genuine linguistic understanding.

    Future Directions

    1. Refinement of Evaluation Protocols: The results encourage the development of evaluation frameworks that integrate entropy-based human agreement measures, fostering a deeper understanding of model reliability amidst varying levels of consensus.
    2. Model Calibration: The paper suggests potential explorations into model calibration, where learning systems might better account for human annotation distributions, ultimately enhancing algorithmic fidelity to authentic human reasoning.
    3. Crowdsourcing Methodologies: Insights from the data annotation process underscore the importance of rigorous quality control mechanisms in crowdsourcing environments, informing future design implementations in dataset expansions.

    This paper underscores a critical need for introspection in NLP research paradigms, advocating for a descriptivist approach in modeling subtleties of human language understanding. As NLP models advance, considering collective human judgment distributions might be vital to achieving nuanced and contextually aware language processing systems.

    Github Logo Streamline Icon: https://streamlinehq.com