CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators
(2210.06812v2)
Published 13 Oct 2022 in cs.LG, cs.HC, and stat.ML
Abstract: Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example that aggregates the available annotations; (2) A confidence score for how likely each consensus label is correct; (3) A rating for each annotator quantifying the overall correctness of their labels. Existing algorithms to estimate related quantities in crowdsourcing often rely on sophisticated generative models with iterative inference. CROWDLAB instead uses a straightforward weighted ensemble. Existing algorithms often rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB utilizes any classifier model trained on these features, and can thus better generalize between examples with similar features. On real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than existing algorithms like Dawid-Skene/GLAD.
The paper introduces CROWDLAB, a non-iterative, weighted aggregation method that fuses classifier and annotator outputs to improve consensus label accuracy.
The method uses normalized accuracies to effectively estimate both label confidence and annotator trustworthiness, outperforming traditional generative models.
Experiments on datasets like CIFAR-10H show superior performance in consensus quality estimation and reliable annotator ranking.
An Analysis of CROWDLAB: Supervised Learning to Infer Consensus Labels and Quality Scores in Multi-Annotator Data
The paper "CROWDLAB: Supervised Learning to Infer Consensus Labels and Quality Scores for Data with Multiple Annotators," authored by Hui Wen Goh, Ulyana Tkachenko, and Jonas Mueller, proposes a new method for handling classification data labeled by multiple annotators, a common scenario in real-world machine learning applications. The proposed method, CROWDLAB (Classifier Refinement Of croWDsourced LABels), is designed to work with any trained classifier to perform three tasks: infer accurate consensus labels, estimate the confidence of these labels, and assess the quality of individual annotators.
Methodology Overview
CROWDLAB introduces a non-iterative, computationally efficient approach that leverages weighted ensemble aggregation to combine classifier predictions with individual annotations. This contrasts with traditional generative models like Dawid-Skene or GLAD, which rely on iterative, often complex inference procedures. CROWDLAB weights each annotator's input based on its estimated trustworthiness relative to the classifier, thus incorporating useful instance-level features that generative models typically ignore.
Detailed Approach
Notation and Problem Setup
The authors establish a formal setting where a dataset composed of feature-class pairs (X,Y) is labeled by multiple annotators. Each annotator may only label a subset of instances, and each instance may receive varying numbers of labels. The true classes Yi are unknown, and the objective is to estimate consensus labels Yi, their quality scores qi, and annotator quality scores aj.
Core Components
Weighted Aggregation: CROWDLAB combines classifier predictions pM with annotator-derived probabilities pAj using weights wj computed as:
pCR=wM+∑j∈JiwjwM⋅pM+∑j∈Jiwj⋅pAj,
where wj represents the annotator's trustworthiness and wM the accuracy of the classifier M.
Estimation of Trustworthiness: Annotator trustworthiness wj and the classifier's weight wM are estimated using normalized accuracy measures:
wj=1−1−AMLC1−sj,
wM=(1−1−AMLC1−AM)⋅n1i∑∣Ji∣,
where sj is the annotator's accuracy and AM is the model's accuracy against majority-vote consensus.
Scoring Consensus and Annotators: The quality of consensus labels is measured using a Label Quality Score (LQS):
qi=L(Yi,pCR)
and annotator quality uses a weighted average of label quality and agreement with consensus:
$a_j = \widebar{w} Q_j + (1-\widebar{w}) A_j,$
where $\widebar{w}$ is a composite weight balancing classifier and annotator trustworthiness.
Experimental Results
Multiple datasets derived from CIFAR-10H were used to evaluate CROWDLAB. Key findings include:
Consensus Label Accuracy: CROWDLAB outperformed traditional methods by producing more accurate consensus labels.
Consensus Quality Estimation: In terms of AUROC, AUPRC, and Lift metrics, CROWDLAB consistently provided superior estimates for consensus label quality compared to baseline methods.
Annotator Quality Estimation: High Spearman correlation scores indicated that CROWDLAB effectively ranked annotators by accuracy.
Practical Implications and Future Directions
CROWDLAB's ability to work with any classifier makes it adaptable to advancements in machine learning, including the integrated use of high-performance models like Swin Transformers. This flexibility is crucial as machine learning models continue to improve and diversify. The method's computational efficiency and non-iterative nature make it suitable for scalable applications in crowdsourced labeling tasks, active learning, and beyond.
Future research could explore the integration of CROWDLAB into active learning systems, where it can iteratively improve dataset quality by querying the most uncertain or supposedly mislabeled instances for re-annotation. Furthermore, extending CROWDLAB to handle other data types, such as text and audio annotations, can broaden its applicability across diverse fields.
In conclusion, CROWDLAB offers a robust, flexible, and computationally efficient solution for improving the management and quality assessment of multi-annotator data, providing a significant advancement in the field of supervised learning from crowdsourced annotations.