Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

Diverging Preferences: When do Annotators Disagree and do Models Know? (2410.14632v2)

Published 18 Oct 2024 in cs.CL

Abstract: We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes -- task underspecification, response style, refusals, and annotation errors. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact two areas of LLM development: reward modeling and evaluation. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences. We also find that these tendencies are also echoed by popular LLM-as-Judge evaluation methods, which consistently identify a winning response in cases of diverging preferences. These findings highlight remaining challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  3. We need to consider disagreement in evaluation. In Proceedings of the 1st workshop on benchmarking: past, present and future, pp.  15–21. Association for Computational Linguistics, 2021.
  4. The art of saying no: Contextual noncompliance in language models. arXiv preprint arXiv:2407.12043, 2024.
  5. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  6. Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences, 2024. URL https://arxiv.org/abs/2402.08925.
  7. Pal: Pluralistic alignment framework for learning from heterogeneous preferences, 2024. URL https://arxiv.org/abs/2406.08469.
  8. Chatbot arena: An open platform for evaluating llms by human preference, 2024.
  9. Llm.int8(): 8-bit matrix multiplication for transformers at scale. 2022. URL https://arxiv.org/pdf/2208.07339.
  10. Qlora: Efficient finetuning of quantized llms. 2024. URL https://arxiv.org/pdf/2305.14314.
  11. The llama 3 herd of models, 2024a. URL https://arxiv.org/abs/2407.21783.
  12. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024b.
  13. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
  14. Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv.org/abs/2404.04475.
  15. Social chemistry 101: Learning to reason about social and moral norms. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  16. Perspectivist approaches to natural language processing: a survey. Language Resources and Evaluation, pp.  1–28, 2024.
  17. LoRA: Low-rank adaptation of large language models. 2022. URL https://arxiv.org/pdf/2106.09685.
  18. Investigating reasons for disagreement in natural language inference. Transactions of the Association for Computational Linguistics, 10:1357–1374, 2022.
  19. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.
  20. Rewardbench: Evaluating reward models for language modeling, 2024.
  21. We’re afraid language models aren’t modeling ambiguity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  790–807, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.51. URL https://aclanthology.org/2023.emnlp-main.51.
  22. Rule based rewards for fine-grained llm safety. In ICML 2024 Next Generation of AI Safety Workshop, 2024a.
  23. Rule based rewards for language model safety. In ICML 2024 Next Generation of AI Safety Workshop, 2024b.
  24. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  25. Rebecca J Passonneau. Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), 2006.
  26. Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019. URL http://arxiv.org/abs/1912.01703.
  27. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694, 2019.
  28. Improving context-aware preference modeling for language models. arXiv preprint arXiv:2407.14916, 2024.
  29. Personalizing reinforcement learning from human feedback with variational preference learning. 2024.
  30. Jiacheng Xu Prasann Singhal, Tanya Goyal and Greg Durrett. A long way to go: Investigating length correlations in rlhf. arXiv, 2023.
  31. Design choices for crowdsourcing implicit discourse relations: revealing the biases introduced by task design. Transactions of the Association for Computational Linguistics, 11:1014–1032, 2023.
  32. Group robust preference optimization in reward-free rlhf, 2024. URL https://arxiv.org/abs/2405.20304.
  33. RyokoAI. Ryokoai/sharegpt52k. 2023. URL https://huggingface.co/datasets/RyokoAI/ShareGPT52K.
  34. Why don’t you do it right? analysing annotators’ disagreement in subjective tasks. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  2428–2441, 2023.
  35. Nlpositionality: Characterizing design biases of datasets and models. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
  36. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In NAACL, 2022. URL https://aclanthology.org/2022.naacl-main.431/.
  37. Scikit-Learn. Cohen kappa score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html, 2024.
  38. Understanding hidden context in preference learning: Consequences for rlhf. In Socially Responsible Language Modelling Research.
  39. Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023.
  40. A roadmap to pluralistic alignment. arXiv preprint arXiv:2402.05070, 2024.
  41. Semeval-2021 task 12: Learning with disagreements. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pp.  338–347, 2021.
  42. Self-instruct: Aligning language model with self generated instructions, 2022.
  43. Multipref - a multi-annotated and multi-aspect human preference dataset. https://huggingface.co/datasets/allenai/multipref, 2024a.
  44. Helpsteer: Multi-attribute helpfulness dataset for steerlm, 2023.
  45. Helpsteer2: Open-source dataset for training top-performing reward models, 2024b.
  46. Wildbench: Benchmarking llms with challenging tasks from real users in the wild. arXiv e-prints, pp.  arXiv–2406, 2024.
  47. Clarify when necessary: Resolving ambiguity through interaction with lms. arXiv preprint arXiv:2311.09469, 2023.
  48. Wildchat: 1m chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Bl8u7ZRlbM.
  49. Lmsys-chat-1m: A large-scale real-world llm conversation dataset. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=BOfDKxfwt0.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper develops a comprehensive taxonomy of annotator disagreement sources, revealing that most divergences arise from individual user preferences rather than errors.
  • It demonstrates that traditional reward models misinterpret diverse annotations as noise, while distributional models improve AUROC by 0.16 by capturing variance in responses.
  • The study critiques LLM-as-Judge evaluation biases and advocates for models that embrace pluralistic human values to yield fairer, more nuanced outputs.

Diverging Preferences: When do Annotators Disagree and do Models Know?

The paper "Diverging Preferences: When do Annotators Disagree and do Models Know?" presents an insightful analysis into the phenomenon of annotator disagreements within human-labeled preference datasets for training LLMs. The authors develop a comprehensive taxonomy of disagreement sources and assess the implications of these disagreements on reward modeling and LLM evaluation.

The discourse around disagreements in annotator preferences is structured into four high-level categories: task underspecification, response style, refusals, and annotation errors. Within these categories, ten specific sources of disagreement were identified through the analysis of two datasets: MultiPref-Disagreements and HelpSteer2-Disagreements. A key finding is that the majority of disagreements do not stem from errors but rather from individual user preferences, highlighting challenges for standard reward modeling approaches which often treat disagreements as noise.

Significantly, the taxonomy reveals that task underspecification leads to varied interpretations by annotators, often resulting in legitimate but diverse responses from models. Similarly, response style variations—such as verbosity, aesthetic taste, and response complexity—further contribute to diverging preferences. The paper also notes that LLMs tasked with making refusals, whether due to safety considerations or capability limits, frequently face interpretative variability, leading to annotator disagreements.

Empirical results demonstrate that standard reward models consistently misinterpret disagreements as noise, thereby executing decisive judgments in situations where preferences diverge. This poses a critical challenge for the development of pluralistically aligned models, as it can lead to the learning of biased behaviors aligned towards the more popular opinion in situations where human preferences are actually diverse.

To address these issues, the authors propose distributional reward models which predict a distribution over potential responses instead of a single value. These models show improved performance in identifying diverging preferences, as they can effectively capture the variance in user perspectives that leads to annotator disagreement. Notably, such models enhance the area under the receiver operating characteristic curve (AUROC) by 0.16 compared to standard models, showcasing their potential in nuanced preference modeling and evaluation.

The research also critiques the pervasive use of LLM-as-Judge evaluation systems, which tend to establish a definitive preferred response even in cases with inherent ambiguity. There is a noted bias against LLMs that generate contextually safe or clarifying responses as opposed to directly fulfilling ambiguous or controversial prompts. Implementing distributional insights into these evaluation systems could potentially rectify current biases by equitably recognizing the validity of pluralistic user desires and system responses.

Overall, the paper calls for a paradigm shift in how preference datasets are interpreted and utilized in training and evaluating LLMs. Consideration of divergent opinions, rather than their suppression, aligns with a broader target for generating models that are fairer and more globally applicable across diverse user bases. This line of inquiry offers a promising avenue for advancing computational models in AI that are sensitive to the complexity of human values and judgments. Future research can expand on these findings by exploring further the applicability of distributional reward models in real-world applications and refining evaluation metrics to accommodate a spectrum of human-centric perspectives.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.