Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations (2110.05719v1)

Published 12 Oct 2021 in cs.CL and cs.CY

Abstract: Majority voting and averaging are common approaches employed to resolve annotator disagreements and derive single ground truth labels from multiple annotations. However, annotators may systematically disagree with one another, often reflecting their individual biases and values, especially in the case of subjective tasks such as detecting affect, aggression, and hate speech. Annotator disagreements may capture important nuances in such tasks that are often ignored while aggregating annotations to a single ground truth. In order to address this, we investigate the efficacy of multi-annotator models. In particular, our multi-task based approach treats predicting each annotators' judgements as separate subtasks, while sharing a common learned representation of the task. We show that this approach yields same or better performance than aggregating labels in the data prior to training across seven different binary classification tasks. Our approach also provides a way to estimate uncertainty in predictions, which we demonstrate better correlate with annotation disagreements than traditional methods. Being able to model uncertainty is especially useful in deployment scenarios where knowing when not to make a prediction is important.

Authors (3)

Aida Mostafazadeh Davani (13 papers)
Mark Díaz (26 papers)
Vinodkumar Prabhakaran (48 papers)

Citations (273)

View on Semantic Scholar

Summary

The paper proposes a multi-annotator framework that treats each annotator’s input as a distinct subtask to preserve subjective diversity.
This methodology improves prediction performance and uncertainty estimation across seven binary classification tasks.
It offers a scalable approach for inclusive AI by retaining minority perspectives and mitigating biases inherent in majority vote aggregation.

Analyzing Annotation Discrepancies in Subjective NLP Tasks

The paper "Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations" addresses the challenges and strategies for handling subjective annotator disagreements in NLP tasks. The authors present a detailed investigation into how traditional approaches, such as majority voting, fail to preserve the diversity and richness of human perspectives, particularly in subjective tasks such as hate speech detection and emotion recognition.

Key Contributions

One of the core contributions of the paper is the proposal of using multi-annotator architectures that treat each annotator's judgments as separate subtasks while operating under a unified framework. This method preserves the systematic differences in annotator perspectives, which are often flattened in common practices like majority voting. The multi-task framework not only captures these subtleties but also improves predictive performance when compared to majority-vote labels aggregated pre-training.

The empirical analysis, conducted across seven binary classification tasks, showcases that the multi-annotator approach either matches or surpasses traditional methods in performance. More importantly, it provides better estimations of uncertainty in predictions, a critical aspect in scenarios that demand awareness of when a model should abstain from making a conclusive decision.

The Implications of Multi-Annotator Models

The implications of this research are manifold. Firstly, the methodological innovation allows for the modeling of annotator disagreements, which can lead to a richer understanding of subjectivity in tasks such as hate speech and emotion detection. It avoids sacrificing the nuances hidden in minority perspectives and can mitigate biases that are harmful to marginalized communities.

Furthermore, the paper emphasizes the importance of preserving individual annotator judgments to improve decision-making processes. For instance, the uncertainty estimates derived from the multi-annotator models provide insights into when predictions should be withheld or escalated for human review, enhancing the deployment of AI systems in real-world applications.

Theoretical and Practical Insights

The theoretical contributions of this work highlight the limitations of assigning single ground-truth labels in subjective domains and provide a harbinger for future studies. It suggests that integrating the diversity of human judgment into machine learning models can lead to more inclusive and representative AI systems.

Practically, the multi-annotator model can be crucial for adapting AI systems to varying cultural norms or moral frameworks by allowing a single model framework that is adjustable based on desired outcomes. Additionally, the paper's insights into modeling disagreements and prediction uncertainty could inform the development of more nuanced systems in contentious fields like content moderation and sentiment analysis.

Future Directions

The work opens several avenues for further research. There is an opportunity to explore how clustering annotators or employing unsupervised learning techniques might mitigate computational expenses with larger annotator pools in crowdsourcing environments. Moreover, integrating such multi-annotator architectures into active learning pipelines could optimize data acquisition by pinpointing which perspectives enhance model comprehensiveness most effectively.

Conclusion

By advancing methodologies that incorporate the full spectrum of annotators' perspectives, this paper provides a substantial contribution to the field of subjective NLP tasks. Its findings and proposed models encourage a rethinking of how disagreements are handled, emphasizing the value of diversity in perspective and the importance of uncertainty modeling in AI predictions. As such, the work challenges prevailing norms and presents a scalable solution that honors the complexity of human judgment in machine learning settings.

PDF Markdown