Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards (2402.18571v3)

Published 28 Feb 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Fine-grained control over LLMs remains a significant challenge, hindering their adaptability to diverse user needs. While Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. To address this limitation, we introduce the Directional Preference Alignment (DPA) framework. Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles. Additionally, DPA models user preferences as directions (i.e., unit vectors) in the reward space to achieve user-dependent preference control. Our method involves training a multi-objective reward model and then fine-tuning the LLM with a preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF method adopted by Llama 2. This method enjoys a better performance trade-off across various reward objectives. In comparison with the scalar-reward RLHF, DPA offers users intuitive control over LLM generation: they can arithmetically specify their desired trade-offs (e.g., more helpfulness with less verbosity). We also validate the effectiveness of DPA with real-world alignment experiments on Mistral-7B. Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines such as Direct Preference Optimization (DPO).

Directional Preference Alignment for Fine-Grained Control over LLMs

The paper "Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards," addresses the challenge of aligning LLMs with diverse user preferences using a novel framework called Directional Preference Alignment (DPA). This research is situated in the context of using Reinforcement Learning from Human Feedback (RLHF) to align LLMs, which typically relies on scalar rewards, often failing to capture the complexity of varied human preferences.

Key Concepts and Framework

The paper introduces DPA as an innovative framework aiming to bring fine-grained control over LLMs by considering multi-objective reward modeling. Unlike traditional approaches that utilize scalar-reward RLHF and enforce a single, averaged preference, DPA leverages a directional model of user preferences. Preferences are represented as unit vectors in a multi-objective reward space, allowing for diverse trade-offs and more personalized user experiences.

The DPA framework involves two stages: training a multi-objective reward model and fine-tuning the LLM with a preference-conditioned adaptation of Rejection Sampling Finetuning (RSF) - a method adopted by recent models like Llama 2. This process allows users to arithmetically specify the balance they desire between different objectives, such as helpfulness and verbosity, thereby offering more intuitive control over LLM outputs.

Experimental Validation

To showcase the effectiveness of the proposed model, the authors validate DPA using Mistral-7B, a state-of-the-art LLM. The experiments reveal that DPA more effectively captures and aligns with user-specific preferences than existing scalar-based RLHF methods. For example, DPA facilitates trade-offs such as generating less verbose responses while maintaining helpfulness, something not achievable with traditional methods like Direct Preference Optimization (DPO). The combination of personalized control and multi-objective considerations positions DPA to make significant contributions in personalizing LLM interactions.

Implications and Future Directions

The implications of this research are twofold: practically, DPA enhances LLMs' ability to adapt to diverse user preferences, thus improving user satisfaction in human-AI interaction. Theoretically, DPA offers a novel approach to reward modeling by moving from scalar to directional vectors, fostering a richer representation of complex human preferences.

Looking to the future, challenges remain in optimizing DPA’s performance across different domains and models. Further research could explore the scalarization strategies in high-dimensional preference vectors and how these adjustments impact long-term model alignment and performance. Additionally, improvements in directional preference learning could aid in mitigating biases that are prevalent in current LLMs.

In conclusion, this paper offers a significant advancement in aligning LLMs with user preferences by proposing a robust multi-objective alignment framework. The introduction of directional preference vectors in reward modeling and preference-conditioned adaptation marks a pivotal transformation in enhancing the adaptability and personalization of LLMs in real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Anthropic. Introducing claude. 2023. URL https://www.anthropic.com/index/introducing-claude.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  3. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  6. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
  7. Peering through preferences: Unraveling feedback acquisition for aligning large language models. arXiv preprint arXiv:2308.15812, 2023.
  8. E. Biyik and D. Sadigh. Batch active preference-based learning of reward functions. In Conference on robot learning, pages 519–528. PMLR, 2018.
  9. Aligning robot and human representations. arXiv preprint arXiv:2302.01928, 2023.
  10. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR, 2019.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  12. R. Caruana. Multitask learning. Machine learning, 28:41–75, 1997.
  13. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  14. Odin: Disentangled reward mitigates hacking in rlhf, 2024.
  15. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  16. On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752, 2019.
  17. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  18. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  19. C. C. Coello. Handling preferences in evolutionary multiobjective optimization: A survey. In Proceedings of the 2000 congress on evolutionary computation. CEC00 (Cat. No. 00TH8512), volume 1, pages 30–37. IEEE, 2000.
  20. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  21. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420, 2023.
  22. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  23. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023a. ISSN 2835-8856. URL https://openreview.net/forum?id=m7p5O7zblY.
  24. Steerlm: Attribute conditioned sft as an (user-steerable) alternative to rlhf. arXiv preprint arXiv:2310.05344, 2023b.
  25. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
  26. Moral machine or tyranny of the majority? arXiv preprint arXiv:2305.17319, 2023.
  27. P. C. Fishburn. Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683–692, 1984.
  28. W. V. Gehrlein. Condorcet’s paradox and the likelihood of its occurrence: different perspectives on balanced preferences. Theory and decision, 52:171–199, 2002.
  29. A. Ghane-Kanafi and E. Khorram. A new scalarization method for finding the efficient frontier in non-convex multi-objective problems. Applied Mathematical Modelling, 39(23-24):7483–7498, 2015.
  30. Google. Bard. 2023. URL https://bard.google.com/.
  31. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  32. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
  33. Revisiting scalarization in multi-task learning: A theoretical perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=6EqUpqMnwl.
  34. Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564, 2023.
  35. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  36. Who answers it better? an in-depth analysis of chatgpt and stack overflow answers to software engineering questions. arXiv preprint arXiv:2308.02312, 2023.
  37. Openassistant conversations - democratizing large language model alignment. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=VSJotgbPHF.
  38. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  39. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  40. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  41. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  42. Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. arXiv preprint arXiv:2309.06256, 2023.
  43. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  44. K. O. May. Intransitivity, utility, and the aggregation of preference patterns. Econometrica: Journal of the Econometric Society, pages 1–13, 1954.
  45. Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
  46. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  47. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  48. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In International Conference on Machine Learning, pages 26837–26867. PMLR, 2023.
  49. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  50. Online learning to rank for sequential music recommendation. In Proceedings of the 13th ACM Conference on Recommender Systems, pages 237–245, 2019.
  51. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  52. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. arXiv preprint arXiv:2306.04488, 2023.
  53. Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076, 2023.
  54. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  55. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023a.
  56. A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023b.
  57. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  58. S. H. Sternberg. Mathematics and Social Sciences: Proceedings of the Seminars of Menthon-Saint-Bernard, France (1-27 July, 1960) and of Gösing, Austria (3-27 July, 1961), volume 1. Mouton, 1965.
  59. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056, 2024.
  60. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  61. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  62. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
  63. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  64. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  65. A. Tversky. Intransitivity of preferences. Psychological review, 76(1):31, 1969.
  66. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  67. Pre-trained language models in biomedical domain: A systematic survey. ACM Computing Surveys, 56(3):1–52, 2023a.
  68. Is rlhf more difficult than standard rl? arXiv preprint arXiv:2306.14111, 2023b.
  69. Helpsteer: Multi-attribute helpfulness dataset for steerlm. arXiv preprint arXiv:2311.09528, 2023c.
  70. Helpsteer: Multi-attribute helpfulness dataset for steerlm. arXiv preprint arXiv:2311.09528, 2023d.
  71. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  72. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  73. Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420, 2023a.
  74. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023b.
  75. Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf. arXiv preprint arXiv:2312.11456, 2023.
  76. A theoretical analysis of nash learning from human feedback under general kl-regularized preference. arXiv preprint arXiv:2402.07314, 2024.
  77. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  78. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  79. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  80. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  81. Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708, 2023.
  82. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Haoxiang Wang (35 papers)
  2. Yong Lin (77 papers)
  3. Wei Xiong (172 papers)
  4. Rui Yang (221 papers)
  5. Shizhe Diao (47 papers)
  6. Shuang Qiu (46 papers)
  7. Han Zhao (159 papers)
  8. Tong Zhang (569 papers)
Citations (46)
Youtube Logo Streamline Icon: https://streamlinehq.com