Empirical Annotation Distributions

Updated 21 April 2026

Empirical annotation distributions are normalized frequency profiles of annotations that capture consensus, disagreement, and uncertainty in subjective labeling tasks.
They are constructed using methods like exhaustive labeling, sparse annotation, and imputation, and evaluated with metrics such as entropy, Gini impurity, and Jensen–Shannon divergence.
These distributions enhance models in affective computing, dialog systems, and semantic analysis while addressing challenges like annotator bias and calibration.

Empirical annotation distributions are probability distributions derived directly from the full set of human (or artificial) annotations assigned to data instances in subjective or ambiguous labeling tasks. Distinct from the conventional reduction to majority vote or mean values, these distributions capture the pattern of inter-annotator agreement, disagreement, ambiguity, and uncertainty intrinsic to the annotation process. They serve as foundational signals in modeling subjectivity, ambiguity, and human variability in domains such as affective computing, conversational systems, subjective classification, linguistic sense annotation, and biological curation.

1. Formal Definitions and Core Constructs

Empirical annotation distributions—or "soft labels"—consist of the normalized counts or frequencies of responses for each possible category (or score) on a given item, optionally including abstention or “can’t solve” (cs) categories. For categorical labeling with $C$ classes, $R$ annotators, and explicit abstentions:

$q = (q_1, ..., q_C, q_{cs}), \qquad q_k = \frac{\#\text{annotators choosing }k}{R}$

where $q_{cs}$ denotes the fraction abstaining (Klugmann et al., 5 Oct 2025).

For continuous annotation (e.g., affect dimensions), empirical distributions are often constructed by binning annotator scores (histogram) or by nonparametric kernel density estimation (KDE) in the label space (Tavernor et al., 2024).

In more specialized contexts, empirical annotation distributions can refer to distributions over complex spaces:

Over slot-value sets in dialog state tracking (Chen et al., 2020)
Over semantic similarity ratings in sense annotation (Schlechtweg et al., 2023)
Over word type frequencies for text annotation meta-analysis (Bell et al., 2012)

These distributions retain all observable aspects of annotator variability, including consensus (sharp peaks), disagreement (spread), and abstention (mass on cs).

2. Methods of Collection and Construction

Empirical annotation distributions are constructed under various annotation regimes:

Exhaustive annotation: Each example is labeled by $R$ annotators. The empirical distribution is computed directly from counts (Klugmann et al., 5 Oct 2025).
Sparse/crowdsourced annotation: Each annotator labels only a subset. Soft distributions are defined over the observed subset, but this can introduce bias when annotators differ systematically or sample sizes are small (Lowmanstone et al., 2023).
Augmentation/Imputation: Missing annotations are predicted via imputation models (matrix factorization, neural collaborative filtering, or multitask annotator models), with the resulting distributions reflecting both observed and imputed responses. This enables estimation of full distributions even under partial observation at the cost of model-induced distributional bias (Lowmanstone et al., 2023).

Continuous annotation distributions may be constructed by:

Binning continuous annotator outputs and normalizing counts
Applying kernel density estimators (classically non-parametric, or via differentiable “soft-histogram” layers in an end-to-end model) (Tavernor et al., 2024)

In dialog or semantic annotation (e.g., NUANCED, DURel), distributions reflect ontological preference or proximity ratings, respectively, possibly structured as products of distributions across multiple slots or semantic axes (Chen et al., 2020, Schlechtweg et al., 2023).

3. Modeling, Inference, and Evaluation Frameworks

Empirical annotation distributions underpin several classes of statistical and machine learning models:

Direct distribution modeling: Models are trained to predict the full response distribution rather than a single label. For categorical outputs this can be via cross-entropy or mean squared error on soft labels (Chen et al., 2020, Klugmann et al., 5 Oct 2025). For continuous outputs, models predict discretized histograms or parametric (e.g., Beta) distributions over the response space [(Tavernor et al., 2024, Pinitas et al., 8 Apr 2026)*].
Annotator modeling: To preserve fine-grained human variability, models with per-annotator heads or learned annotator embeddings are trained to predict each annotator's response; the empirical distribution is then obtained by aggregating these predictions. This improves the modeling of variance, skew, and multi-modality, as demonstrated in speech emotion recognition (Tavernor et al., 2024).
Ambiguity and uncertainty quantification: The full distribution enables calculation of entropy, Gini impurity, or specifically designed ambiguity measures (e.g., $\alpha(q)$ , which reflects both class confusion and unresolvability via "can't solve") (Klugmann et al., 5 Oct 2025).
Agreement/Calibration: Empirical distributions enable measurement of inter-annotator reliability (Cohen’s $\kappa$ , Krippendorff’s $\alpha$ ), divergence from computational annotator distributions (KL, Jensen-Shannon), and calibration of automated prediction models (Schlechtweg et al., 2023, Pavlovic et al., 2024).

*Methodological Note: Details for "modeling annotation consensus using the Beta distribution" as in (Pinitas et al., 8 Apr 2026) were not available in the supplied content.

4. Empirical Characterization and Statistical Summaries

Empirical annotation distributions support detailed statistical analyses:

Error and disagreement distribution: In geometric annotation (e.g., polygon boundaries), per-sample distributions of distances (e.g., boundary mean $d_B$ ) reveal systematic effects of object shape, quality assurance, and image complexity. Skewness, kurtosis, and fitted parametric forms (log-normal, gamma) characterize these patterns (Zimmermann et al., 2023).
Ambiguity measures and inference: $\alpha(q)$ provides a scalar summary of distributional ambiguity (aleatoric uncertainty), with Bayesian inference under Dirichlet priors yielding credible intervals and posterior densities for $R$ 0 (Klugmann et al., 5 Oct 2025).
Sense frequency and semantic change: In sense annotation, empirical distributions over clusters of usage types yield sense-frequency distributions, time-series of change, and entropy-based measures of semantic variation (Schlechtweg et al., 2023).
Zipfian distribution analysis: In large-scale text annotation, the empirical word-frequency distribution is used as a diagnostic for annotation "effort" and information density, fitting discrete power-laws to compare manual vs automatic curation (Bell et al., 2012).

Empirical measures such as mean absolute error (MAE), total variation distance (TVD), Jensen–Shannon divergence (JSD), and Earth Mover’s Distance (EMD) are applied to compare predicted to gold soft-label distributions (Chen et al., 2020, Tavernor et al., 2024, Lowmanstone et al., 2023, Pavlovic et al., 2024).

5. Applications Across Domains

Empirical annotation distributions are essential in:

Subjectivity modeling: Training and evaluating models that reflect genuine human disagreement in NLP, sentiment, offensiveness, or dialog state tasks (Chen et al., 2020, Lowmanstone et al., 2023, Pavlovic et al., 2024).
Downstream machine learning: Soft labels derived from empirical distributions enable models to learn aleatoric uncertainty directly—either via probabilistic loss or as example weighting (e.g., weighted by $R$ 1) (Klugmann et al., 5 Oct 2025).
Quality assurance and dataset curation: Distributions highlight high-ambiguity or poorly resolved items for re-annotation, guideline revision, or exclusion (Klugmann et al., 5 Oct 2025, Zimmermann et al., 2023).
Evaluation of AI annotators (LLMs): Direct elicitation of opinion distributions from LLMs enables benchmarking of their ability to reproduce the patterns of human disagreement, revealing systematic deviations such as distributional flattening, bimodality, or bias toward specific classes (Pavlovic et al., 2024).
Sense and semantic change analysis: Aggregated distributions over semantic similarity ratings provide robust tools for tracking shifts in meaning, frequency, or sense usage across time and corpora (Schlechtweg et al., 2023).

Empirical distributions have further been used in imputation to expand the set of available annotations, with careful analysis revealing that many imputation methods tend to contract distributional diversity and overemphasize consensus (Lowmanstone et al., 2023).

6. Limitations, Bias, and Methodological Cautions

Several empirical and methodological challenges are intrinsic to annotation distributions:

Bias from survey design or annotator sampling: Sparse coverage or imputation can bias the soft label toward over-smoothing and majority collapse, as in collaborative filtering or multitask imputation models (Lowmanstone et al., 2023).
Artificial flattening or bimodality in AI annotators: LLMs prompted to output direct probability assignments tend toward a limited set of probability patterns and may systematically favor the positive class ("false positive" bias >90%), failing to capture the spectrum of human uncertainty (Pavlovic et al., 2024).
Representativeness and calibration: Naive approaches to probability elicitation from LLMs or end-to-end predictors may yield uncalibrated distributions with altered entropy structures, deviating from the true empirical human response pattern (Pavlovic et al., 2024).
Ambiguity metric limitations: Most scalar indices (entropy, Gini, etc.) fail to capture the full structure of ambiguity—hence the need for measures like $R$ 2 that treat abstention and class confusion asymmetrically (Klugmann et al., 5 Oct 2025).
Task and domain dependence: Empirical annotation distributions and their statistical regularities are sensitive to task structure, annotation guidelines, and data domain (e.g., shape complexity in polygon annotation, or ontology design in dialog) (Zimmermann et al., 2023, Chen et al., 2020).

7. Summary Table: Key Empirical Annotation Distribution Frameworks

Domain/Task	Empirical Distribution Definition	Notable Evaluation Metrics / Effects
Categorical Classification	$R$ 3	Entropy, Gini impurity, ambiguity $R$ 4, KL/JSD divergence
Continuous Affect	KDE or binned histogram of annotator scores	TVD, JSD, CCC, soft-histogram cross-entropy
Dialog/Slot-filling	Distributions over slot values per utterance	MAE between predicted/empirical priors, slot-update accuracy
Semantic Sense Annotation	Median-based similarity ratings; cluster frequencies	Sense-frequency distributions, temporal $R$ 5-distance
Polygon Annotation (vision)	Distributions of boundary distances/errors	Mean, std. dev., skewness, fitted parametric models
Bulk Biological Annotation	Word frequency distributions (Zipfian analysis)	Power-law exponent $R$ 6, audience-/annotator-effort balance

References

(Chen et al., 2020) NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions
(Schlechtweg et al., 2023) The DURel Annotation Tool: Human and Computational Measurement of Semantic Proximity, Sense Clusters and Semantic Change
(Tavernor et al., 2024) The Whole Is Bigger Than the Sum of Its Parts: Modeling Individual Annotators to Capture Emotional Variability
(Klugmann et al., 5 Oct 2025) Quantifying Ambiguity in Categorical Annotations: A Measure and Statistical Inference Framework
(Zimmermann et al., 2023) An Empirical Study of Uncertainty in Polygon Annotation and the Impact of Quality Assurance
(Cortez et al., 2023) Incorporating Annotator Uncertainty into Representations of Discourse Relations
(Lowmanstone et al., 2023) Annotation Imputation to Individualize Predictions: Initial Studies on Distribution Dynamics and Model Predictions
(Bell et al., 2012) An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
(Pavlovic et al., 2024) The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation