Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

194 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Human Label Variation (HLV)

Updated 4 July 2025

Human Label Variation (HLV) is the inherent diversity in annotations arising from ambiguous instances, subjectivity, and uncertainty.
It challenges the traditional single-ground-truth assumption and calls for soft labels and new evaluation metrics.
Adopting HLV-aware methods, such as modeling full label distributions and crowd layers, enhances model robustness and fairness.

Human Label Variation (HLV) is defined as the inherent, plausible variation in labeling tasks where different human annotators provide different, yet legitimate, labels for the same data instance. Rather than mere “noise” or annotation error, HLV embodies disagreement due to ambiguous instances, subjectivity, uncertainty, multiple plausible answers, or irreconcilable perspectives among annotators. HLV critically challenges the standard “single ground truth” assumption in machine learning pipelines, impacting data quality, model training, and evaluation across a variety of domains, particularly in natural language processing.

1. Defining Human Label Variation

HLV is characterized by persistent, non-random differences in annotation, distinct from annotation errors (which are caused by attention lapses, carelessness, or misunderstanding of guidelines). Key sources of HLV include:

Ambiguity of Instances: Intrinsic ambiguity in language, vision, or task, leading to multiple defensible interpretations.
Annotator Uncertainty: Lack of knowledge or confidence shaping label choice.
Subjectivity and Perspective: Individual, cultural, or demographic differences in interpretation.
Multiple Correct Answers: Situations with legitimate, coexisting labels for a single item.
Genuine Disagreement: Irreconcilable yet valid conceptualizations.
Socio-demographic/Cultural Factors: Annotator background or worldview influencing judgment.

Crucially, HLV is recognized as “plausible variation in annotation... [arising when] humans usually provide their best judgements” and is distinguished from errors, which reflect annotation quality failures.

2. Consequences for Data Quality, Modeling, and Evaluation

Data Quality

HLV directly questions the notion of a unique “gold label.” Aggregating labels (e.g., by majority vote) collapses diversity and may obscure minority or alternative viewpoints. For ambiguous cases (e.g., assessing toxicity or indirectness), ground truth may not exist, and the true range of human judgment is informational.

Modeling

Traditional models trained on aggregated labels may overfit to dominant perspectives and fail to generalize to sensitive, ambiguous, or minority-relevant cases. Properly leveraging HLV—in the form of label distributions or unaggregated labels—leads to models that better capture human uncertainty and complexity.

Evaluation

Standard accuracy and F1 metrics, computed against a single gold label, are inadequate when plausible disagreement exists. A model may appear highly accurate but still conflict with substantial, valid minority perspectives. As a result, new evaluation metrics that incorporate the full spectrum of annotation (e.g., entropy, KL-divergence to human label distributions, fuzzy-set-based soft accuracy) are needed.

Example Domains

Natural Language Inference (NLI): Over 20% of instances show inherent human disagreement.
Sentiment or Hate Speech: Labels reflect annotator cultural/political identity.
Computer Vision: Annotator uncertainty/disagreement is measurable in object classification.

3. Approaches for Addressing HLV

Aggregation and Filtering (Prevailing Approach)

Majority voting or probabilistic aggregation (e.g., Dawid & Skene) reduces multiple labels to one, often discarding ambiguous instances.
Filtering ambiguous or low-consensus data tries to “purify” datasets but discards genuinely valuable, ambiguous data—sometimes harming downstream performance.

Embracing HLV: Learning from Disagreement

Soft labels: Utilize empirical distributions over all annotator labels for training.
Repeated labeling frameworks: Train on all un-aggregated annotations to preserve diversity of perspectives.
Crowd layers, multi-task architectures: Jointly predict class labels and annotator-specific behaviors.
Regularization using disagreement: Incorporating human entropy or other uncertainty signals as part of the model objective.
Deployment of new evaluation metrics: Soft accuracy, fuzzy set-inspired F1, or divergence from human judgment distributions.

Gaps Identified

Fragmented field: Work is siloed by domain (NLP, CV, HCI), with little unified theory or cross-domain benchmarking.
Evaluation regresses to single-label metrics: Even in research advocating disagreement, evaluations tend to default to gold standard labels.
Lack of resources and large-scale benchmarks: Few datasets with public, un-aggregated annotations; insufficient cross-task analyses.

4. Forward-Looking Strategies and Research Directions

Dataset Publication: Major recommendation for releasing datasets that preserve all annotator labels and document annotator backgrounds to enable HLV-aware modeling and analysis.
Hybrid Algorithms: Models that integrate human label distribution modeling, multi-task (e.g., auxiliary tasks for bias/uncertainty), and regularization.
Active Learning with Disagreement: Selectively sample ambiguous instances for further annotation, using acquisition functions (e.g., group entropy) designed for HLV; capture perspectives that are maximally informative about real-world ambiguity.
Evaluation Innovations: Develop metrics like instance-level calibration (comparing model predictions to true human label distributions), not just hard label accuracy.
Perspectivist Approaches: Support for both prescriptive and descriptive annotation and evaluation paradigms to understand subjectivity.

5. Data Resources and Benchmarks

The paper provides the most extensive repository of datasets with unaggregated (per-annotator) labels, spanning:

NLP: NER, POS tagging, NLI (ChaosNLI), sentiment, coreference, syntactic parsing, hate speech, and more.
Vision: Cifar-10H, with distributional annotations per image.
Repository URL: https://github.com/mainlp/awesome-human-label-variation

Such resources facilitate benchmarking algorithms, analyzing subjectivity, and improving robustness on ambiguous/hard-instance subsets.

6. Broader Implications and Recommendations

Embracing HLV has major ramifications:

Rethinking Ground Truth: Displaces the traditional objective “gold standard” in favor of distributional truth—especially in linguistically/socially complex domains.
Fairness and Inclusivity: Minority perspectives are maintained, not suppressed; datasets/model outputs reflect true diversity.
Human-Centered Systems: Models trained with HLV in mind are more aligned with real-world uncertainty and user perspectives.
Improved Trustworthiness: Evaluation metrics grounded in human label distributions allow model calibration and reliability assessment for ambiguous cases.

Future progress depends on inter-disciplinary collaboration (NLP, CV, HCI, social sciences), continued resource development, shared tasks (e.g., Learning With Disagreement at SemEval), and best-practices development in data creation and ML evaluation.

Table: HLV in the ML Pipeline

Pipeline Step	HLV Challenge	HLV Opportunity
Data	No clear gold standard for subjective/ambiguous items	Can capture full human perspective and uncertainty
Modeling	Models ignore uncertainty if learning from one label	Models can capture ambiguity and improve robustness
Evaluation	Accuracy misrepresents reasonableness	Use soft metrics, instance-level calibration, and nuanced analysis

Technical Formulations

Label Entropy:

$H(y) = -\sum_{i} p(y_i) \log p(y_i)$

where $p(y_i)$ is the empirical label frequency.

KL Divergence:

$D_{KL}(P\|Q) = \sum_{i} P(y_i) \log \frac{P(y_i)}{Q(y_i)}$

Calibration: Recommend comparing model output distributions to full human label distributions (not just majority accuracy).

By treating human label variation as signal—rather than error to suppress—data science and AI communities can develop deeper, fairer, and more robust ML systems. Datasets, methods, and evaluation frameworks must be designed from the ground up to preserve and take advantage of disagreement as an intrinsic property of complex annotation tasks.

PDF Markdown Chat (Upgrade)