Diverging Preferences: When do Annotators Disagree and do Models Know? (2410.14632v2)

Published 18 Oct 2024 in cs.CL

Abstract: We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes -- task underspecification, response style, refusals, and annotation errors. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact two areas of LLM development: reward modeling and evaluation. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences. We also find that these tendencies are also echoed by popular LLM-as-Judge evaluation methods, which consistently identify a winning response in cases of diverging preferences. These findings highlight remaining challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.

References (49)

Citations (1)

View on Semantic Scholar

Collections

Summary

The paper develops a comprehensive taxonomy of annotator disagreement sources, revealing that most divergences arise from individual user preferences rather than errors.
It demonstrates that traditional reward models misinterpret diverse annotations as noise, while distributional models improve AUROC by 0.16 by capturing variance in responses.
The study critiques LLM-as-Judge evaluation biases and advocates for models that embrace pluralistic human values to yield fairer, more nuanced outputs.

Diverging Preferences: When do Annotators Disagree and do Models Know?

The paper "Diverging Preferences: When do Annotators Disagree and do Models Know?" presents an insightful analysis into the phenomenon of annotator disagreements within human-labeled preference datasets for training LLMs. The authors develop a comprehensive taxonomy of disagreement sources and assess the implications of these disagreements on reward modeling and LLM evaluation.

The discourse around disagreements in annotator preferences is structured into four high-level categories: task underspecification, response style, refusals, and annotation errors. Within these categories, ten specific sources of disagreement were identified through the analysis of two datasets: MultiPref-Disagreements and HelpSteer2-Disagreements. A key finding is that the majority of disagreements do not stem from errors but rather from individual user preferences, highlighting challenges for standard reward modeling approaches which often treat disagreements as noise.

Significantly, the taxonomy reveals that task underspecification leads to varied interpretations by annotators, often resulting in legitimate but diverse responses from models. Similarly, response style variations—such as verbosity, aesthetic taste, and response complexity—further contribute to diverging preferences. The paper also notes that LLMs tasked with making refusals, whether due to safety considerations or capability limits, frequently face interpretative variability, leading to annotator disagreements.

Empirical results demonstrate that standard reward models consistently misinterpret disagreements as noise, thereby executing decisive judgments in situations where preferences diverge. This poses a critical challenge for the development of pluralistically aligned models, as it can lead to the learning of biased behaviors aligned towards the more popular opinion in situations where human preferences are actually diverse.

To address these issues, the authors propose distributional reward models which predict a distribution over potential responses instead of a single value. These models show improved performance in identifying diverging preferences, as they can effectively capture the variance in user perspectives that leads to annotator disagreement. Notably, such models enhance the area under the receiver operating characteristic curve (AUROC) by 0.16 compared to standard models, showcasing their potential in nuanced preference modeling and evaluation.

The research also critiques the pervasive use of LLM-as-Judge evaluation systems, which tend to establish a definitive preferred response even in cases with inherent ambiguity. There is a noted bias against LLMs that generate contextually safe or clarifying responses as opposed to directly fulfilling ambiguous or controversial prompts. Implementing distributional insights into these evaluation systems could potentially rectify current biases by equitably recognizing the validity of pluralistic user desires and system responses.

Overall, the paper calls for a paradigm shift in how preference datasets are interpreted and utilized in training and evaluating LLMs. Consideration of divergent opinions, rather than their suppression, aligns with a broader target for generating models that are fairer and more globally applicable across diverse user bases. This line of inquiry offers a promising avenue for advancing computational models in AI that are sensitive to the complexity of human values and judgments. Future research can expand on these findings by exploring further the applicability of distributional reward models in real-world applications and refining evaluation metrics to accommodate a spectrum of human-centric perspectives.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (9)

Tweets

https://twitter.com/mjqzhang/status/1853555968476696780

https://twitter.com/GptMaestro/status/1849675435589906794

https://twitter.com/gm8xx8/status/1848558608461795671