Diverging Preferences: When do Annotators Disagree and do Models Know? (2410.14632v2)
Abstract: We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes -- task underspecification, response style, refusals, and annotation errors. We find that the majority of disagreements are in opposition with standard reward modeling approaches, which are designed with the assumption that annotator disagreement is noise. We then explore how these findings impact two areas of LLM development: reward modeling and evaluation. In our experiments, we demonstrate how standard reward modeling methods, like the Bradley-Terry model, fail to differentiate whether a given preference judgment is the result of unanimous agreement among annotators or the majority opinion among diverging user preferences. We also find that these tendencies are also echoed by popular LLM-as-Judge evaluation methods, which consistently identify a winning response in cases of diverging preferences. These findings highlight remaining challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- We need to consider disagreement in evaluation. In Proceedings of the 1st workshop on benchmarking: past, present and future, pp. 15–21. Association for Computational Linguistics, 2021.
- The art of saying no: Contextual noncompliance in language models. arXiv preprint arXiv:2407.12043, 2024.
- Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Maxmin-rlhf: Towards equitable alignment of large language models with diverse human preferences, 2024. URL https://arxiv.org/abs/2402.08925.
- Pal: Pluralistic alignment framework for learning from heterogeneous preferences, 2024. URL https://arxiv.org/abs/2406.08469.
- Chatbot arena: An open platform for evaluating llms by human preference, 2024.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale. 2022. URL https://arxiv.org/pdf/2208.07339.
- Qlora: Efficient finetuning of quantized llms. 2024. URL https://arxiv.org/pdf/2305.14314.
- The llama 3 herd of models, 2024a. URL https://arxiv.org/abs/2407.21783.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024b.
- Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
- Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv.org/abs/2404.04475.
- Social chemistry 101: Learning to reason about social and moral norms. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
- Perspectivist approaches to natural language processing: a survey. Language Resources and Evaluation, pp. 1–28, 2024.
- LoRA: Low-rank adaptation of large language models. 2022. URL https://arxiv.org/pdf/2106.09685.
- Investigating reasons for disagreement in natural language inference. Transactions of the Association for Computational Linguistics, 10:1357–1374, 2022.
- Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36, 2024.
- Rewardbench: Evaluating reward models for language modeling, 2024.
- We’re afraid language models aren’t modeling ambiguity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 790–807, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.51. URL https://aclanthology.org/2023.emnlp-main.51.
- Rule based rewards for fine-grained llm safety. In ICML 2024 Next Generation of AI Safety Workshop, 2024a.
- Rule based rewards for language model safety. In ICML 2024 Next Generation of AI Safety Workshop, 2024b.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Rebecca J Passonneau. Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), 2006.
- Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019. URL http://arxiv.org/abs/1912.01703.
- Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694, 2019.
- Improving context-aware preference modeling for language models. arXiv preprint arXiv:2407.14916, 2024.
- Personalizing reinforcement learning from human feedback with variational preference learning. 2024.
- Jiacheng Xu Prasann Singhal, Tanya Goyal and Greg Durrett. A long way to go: Investigating length correlations in rlhf. arXiv, 2023.
- Design choices for crowdsourcing implicit discourse relations: revealing the biases introduced by task design. Transactions of the Association for Computational Linguistics, 11:1014–1032, 2023.
- Group robust preference optimization in reward-free rlhf, 2024. URL https://arxiv.org/abs/2405.20304.
- RyokoAI. Ryokoai/sharegpt52k. 2023. URL https://huggingface.co/datasets/RyokoAI/ShareGPT52K.
- Why don’t you do it right? analysing annotators’ disagreement in subjective tasks. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2428–2441, 2023.
- Nlpositionality: Characterizing design biases of datasets and models. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
- Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In NAACL, 2022. URL https://aclanthology.org/2022.naacl-main.431/.
- Scikit-Learn. Cohen kappa score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html, 2024.
- Understanding hidden context in preference learning: Consequences for rlhf. In Socially Responsible Language Modelling Research.
- Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023.
- A roadmap to pluralistic alignment. arXiv preprint arXiv:2402.05070, 2024.
- Semeval-2021 task 12: Learning with disagreements. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pp. 338–347, 2021.
- Self-instruct: Aligning language model with self generated instructions, 2022.
- Multipref - a multi-annotated and multi-aspect human preference dataset. https://huggingface.co/datasets/allenai/multipref, 2024a.
- Helpsteer: Multi-attribute helpfulness dataset for steerlm, 2023.
- Helpsteer2: Open-source dataset for training top-performing reward models, 2024b.
- Wildbench: Benchmarking llms with challenging tasks from real users in the wild. arXiv e-prints, pp. arXiv–2406, 2024.
- Clarify when necessary: Resolving ambiguity through interaction with lms. arXiv preprint arXiv:2311.09469, 2023.
- Wildchat: 1m chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Bl8u7ZRlbM.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=BOfDKxfwt0.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.