Provable Multi-Party Reinforcement Learning with Diverse Human Feedback (2403.05006v1)
Abstract: Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work \textit{initiates} the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual's preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity.
- Pareto efficiency and approximate pareto efficiency in routing and load balancing games. In International Symposium on Algorithmic Game Theory, pages 66–77. Springer.
- Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189.
- Finding fair and efficient allocations. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 557–574.
- Barron, E. N. (2013). Game theory: an introduction. John Wiley & Sons.
- Baxter, J. (2000). A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Consistent probabilistic social choice. Econometrica, 84(5):1839–1880.
- Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, volume 162, pages 3773–3793. PMLR.
- Deep reinforcement learning from human preferences. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- When is offline two-player zero-sum markov game solvable? arXiv preprint arXiv:2201.03522.
- Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434.
- Contextual dueling bandits. In Conference on Learning Theory, pages 563–587. PMLR.
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR.
- Generative social choice. arXiv preprint arXiv:2309.01291.
- Fishburn, P. C. (1984). Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683–692.
- A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17:1 – 6.
- Approximate efficiency in matching markets. In International Conference on Web and Internet Economics, pages 252–265. Springer.
- The power of contrast for feature learning: A theoretical analysis. Journal of Machine Learning Research, 24(330):1–78.
- Rank aggregation via heterogeneous thurstone preference models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4353–4360.
- Rate control for communication networks: shadow prices, proportional fairness and stability. Journal of the Operational Research society, 49:237–252.
- Pessimism for offline linear contextual bandits using ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT confidence sets. Advances in Neural Information Processing Systems, 35:20974–20987.
- Multi-dimensional domain generalization with low-rank structures. arXiv preprint arXiv:2309.09555.
- The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1–32.
- Moulin, H. (2004). Fair division and collective welfare. MIT press.
- Nash, J. (1953). Two-person cooperative games. Econometrica: Journal of the Econometric Society, pages 128–140.
- Approximate pareto set for fair and efficient allocation: Few agent types or few resource types. In IJCAI, pages 290–296.
- Axioms for learning from pairwise comparisons. Advances in Neural Information Processing Systems, 33:17745–17754.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Owen, G. (2013). Game theory. Emerald Group Publishing.
- Measuring fairness, inequality, and big data: Social choice since arrow. Annual Review of Political Science, 22(1):435–460.
- An optimal single-winner preferential voting system based on game theory. In International Workshop on Computational Social Choice, pages 399–410. Citeseer.
- Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
- Reinforcement learning: An introduction. MIT press.
- Provable meta-learning of linear representations. In International Conference on Machine Learning.
- Tversky, A. (1969). Intransitivity of preferences. Psychological review, 76(1):31.
- Generic exploration and K-armed voting bandits. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 91–99. PMLR.
- Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
- Is rlhf more difficult than standard rl? arXiv preprint arXiv:2306.14111.
- Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in Neural Information Processing Systems, 34:27395–27407.
- A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102:315–323.
- Fairness-efficiency tradeoffs in dynamic fair division. In Proceedings of the 21st ACM Conference on Economics and Computation, pages 911–912.
- Provable offline preference-based reinforcement learning. arXiv preprint arXiv:2305.14816.
- Proportional fairness in federated learning. arXiv preprint arXiv:2202.01666.
- Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Huiying Zhong (3 papers)
- Zhun Deng (38 papers)
- Weijie J. Su (69 papers)
- Zhiwei Steven Wu (143 papers)
- Linjun Zhang (70 papers)