Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalizing Reward Modeling for Out-of-Distribution Preference Learning (2402.14760v2)

Published 22 Feb 2024 in cs.LG and cs.CL

Abstract: Preference learning (PL) with LLMs aims to align the LLMs' generations with human preferences. Previous work on reinforcement learning from human feedback (RLHF) has demonstrated promising results in in-distribution PL. However, due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging. Thus, out-of-distribution (OOD) PL is practically useful for enhancing the generalization ability of LLMs with limited preference feedback. This work addresses OOD PL by optimizing a general reward model through a meta-learning approach. During meta-training, a bilevel optimization algorithm is utilized to learn a reward model capable of guiding policy learning to align with human preferences across various distributions. When encountering a test distribution, the meta-test procedure conducts regularized policy optimization using the learned reward model for PL. We theoretically demonstrate the convergence rate of the bilevel optimization algorithm under reasonable assumptions. Additionally, we conduct experiments on two text generation tasks across 20 held-out domains and outperform a variety of strong baselines across various evaluation metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Jonathan Baxter. A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198, 2000.
  5. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, pp.  440–447, 2007.
  6. Convex optimization. Cambridge university press, 2004.
  7. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  8. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research, 2023.
  9. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. An overview of bilevel optimization. Annals of operations research, 153:235–256, 2007.
  12. The complexity of finding stationary points with stochastic gradient descent. In International Conference on Machine Learning, pp. 2658–2667. PMLR, 2020.
  13. Understanding dataset difficulty with v-usable information. In International Conference on Machine Learning, pp. 5988–6008. PMLR, 2022.
  14. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp. 1126–1135. PMLR, 2017.
  15. Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pp. 1568–1577. PMLR, 2018.
  16. Approximation methods for bilevel programming. arXiv preprint arXiv:1802.02246, 2018.
  17. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  18. Bilevel optimization: Convergence analysis and enhanced design. In International conference on machine learning, pp. 4882–4892. PMLR, 2021.
  19. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  20. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp.  5542–5550, 2017.
  21. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  22. Aligning large language models with human preferences through representation engineering. arXiv preprint arXiv:2312.15997, 2023.
  23. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  24. On the fragility of learned reward functions. In NeurIPS ML Safety Workshop, 2022.
  25. Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  27. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  28. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp.  745–750, 2007.
  29. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  31. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In International Conference on Learning Representations, 2023.
  32. Warm: On the benefits of weight averaged reward models. arXiv preprint arXiv:2401.12187, 2024.
  33. Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. PMLR, 2015.
  34. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  35. A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation, 22(2):276–295, 2017.
  36. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  37. A perspective view and survey of meta-learning. Artificial intelligence review, 18:77–95, 2002.
  38. The wisdom of hindsight makes language models better instruction followers. arXiv preprint arXiv:2302.05206, 2023.
  39. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Chen Jia (42 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets