Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators (2210.06812v2)

Published 13 Oct 2022 in cs.LG, cs.HC, and stat.ML

Abstract: Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example that aggregates the available annotations; (2) A confidence score for how likely each consensus label is correct; (3) A rating for each annotator quantifying the overall correctness of their labels. Existing algorithms to estimate related quantities in crowdsourcing often rely on sophisticated generative models with iterative inference. CROWDLAB instead uses a straightforward weighted ensemble. Existing algorithms often rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB utilizes any classifier model trained on these features, and can thus better generalize between examples with similar features. On real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than existing algorithms like Dawid-Skene/GLAD.

Citations (9)

Summary

  • The paper introduces CROWDLAB, a non-iterative, weighted aggregation method that fuses classifier and annotator outputs to improve consensus label accuracy.
  • The method uses normalized accuracies to effectively estimate both label confidence and annotator trustworthiness, outperforming traditional generative models.
  • Experiments on datasets like CIFAR-10H show superior performance in consensus quality estimation and reliable annotator ranking.

An Analysis of CROWDLAB: Supervised Learning to Infer Consensus Labels and Quality Scores in Multi-Annotator Data

The paper "CROWDLAB: Supervised Learning to Infer Consensus Labels and Quality Scores for Data with Multiple Annotators," authored by Hui Wen Goh, Ulyana Tkachenko, and Jonas Mueller, proposes a new method for handling classification data labeled by multiple annotators, a common scenario in real-world machine learning applications. The proposed method, CROWDLAB (Classifier Refinement Of croWDsourced LABels), is designed to work with any trained classifier to perform three tasks: infer accurate consensus labels, estimate the confidence of these labels, and assess the quality of individual annotators.

Methodology Overview

CROWDLAB introduces a non-iterative, computationally efficient approach that leverages weighted ensemble aggregation to combine classifier predictions with individual annotations. This contrasts with traditional generative models like Dawid-Skene or GLAD, which rely on iterative, often complex inference procedures. CROWDLAB weights each annotator's input based on its estimated trustworthiness relative to the classifier, thus incorporating useful instance-level features that generative models typically ignore.

Detailed Approach

Notation and Problem Setup

The authors establish a formal setting where a dataset composed of feature-class pairs (X,Y)(X, Y) is labeled by multiple annotators. Each annotator may only label a subset of instances, and each instance may receive varying numbers of labels. The true classes YiY_i are unknown, and the objective is to estimate consensus labels Y^i\widehat{Y}_i, their quality scores qiq_i, and annotator quality scores aja_j.

Core Components

  1. Weighted Aggregation: CROWDLAB combines classifier predictions p^M\widehat{p}_\mathcal{M} with annotator-derived probabilities p^Aj\widehat{p}_{\mathcal{A}_j} using weights wjw_j computed as:

    p^CR=wMp^M+jJiwjp^AjwM+jJiwj,\widehat{p}_\text{CR} = \frac{w_\mathcal{M} \cdot \widehat{p}_\mathcal{M} + \sum_{j \in \mathcal{J}_i} w_j \cdot \widehat{p}_\mathcal{A_j}}{w_\mathcal{M} + \sum_{j \in \mathcal{J}_i} w_j},

where wjw_j represents the annotator's trustworthiness and wMw_\mathcal{M} the accuracy of the classifier M\mathcal{M}.

  1. Estimation of Trustworthiness: Annotator trustworthiness wjw_j and the classifier's weight wMw_\mathcal{M} are estimated using normalized accuracy measures:

    wj=11sj1AMLC,w_j = 1 - \frac{1 - s_j}{1 - A_\text{MLC}},

    wM=(11AM1AMLC)1niJi,w_\mathcal{M} = \left( 1 - \frac{1 - A_\mathcal{M}}{1 - A_\text{MLC}} \right) \cdot \sqrt{\frac{1}{n} \sum_i |\mathcal{J}_i|},

where sjs_j is the annotator's accuracy and AMA_\mathcal{M} is the model's accuracy against majority-vote consensus.

  1. Scoring Consensus and Annotators: The quality of consensus labels is measured using a Label Quality Score (LQS):

    qi=L(Y^i,p^CR)q_i = L(\widehat{Y}_i, \widehat{p}_{\text{CR}})

and annotator quality uses a weighted average of label quality and agreement with consensus:

$a_j = \widebar{w} Q_j + (1-\widebar{w}) A_j,$

where $\widebar{w}$ is a composite weight balancing classifier and annotator trustworthiness.

Experimental Results

Multiple datasets derived from CIFAR-10H were used to evaluate CROWDLAB. Key findings include:

  • Consensus Label Accuracy: CROWDLAB outperformed traditional methods by producing more accurate consensus labels.
  • Consensus Quality Estimation: In terms of AUROC, AUPRC, and Lift metrics, CROWDLAB consistently provided superior estimates for consensus label quality compared to baseline methods.
  • Annotator Quality Estimation: High Spearman correlation scores indicated that CROWDLAB effectively ranked annotators by accuracy.

Practical Implications and Future Directions

CROWDLAB's ability to work with any classifier makes it adaptable to advancements in machine learning, including the integrated use of high-performance models like Swin Transformers. This flexibility is crucial as machine learning models continue to improve and diversify. The method's computational efficiency and non-iterative nature make it suitable for scalable applications in crowdsourced labeling tasks, active learning, and beyond.

Future research could explore the integration of CROWDLAB into active learning systems, where it can iteratively improve dataset quality by querying the most uncertain or supposedly mislabeled instances for re-annotation. Furthermore, extending CROWDLAB to handle other data types, such as text and audio annotations, can broaden its applicability across diverse fields.

In conclusion, CROWDLAB offers a robust, flexible, and computationally efficient solution for improving the management and quality assessment of multi-annotator data, providing a significant advancement in the field of supervised learning from crowdsourced annotations.

X Twitter Logo Streamline Icon: https://streamlinehq.com