Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-Tuning Language Models with Reward Learning on Policy (2403.19279v1)

Published 28 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning LLMs to human preferences. RLHF contains three steps, i.e., human preference collecting, reward learning, and policy optimization, which are usually performed serially. Despite its popularity, however, (fixed) reward models may suffer from inaccurate off-distribution, since policy optimization continuously shifts LLMs' data distribution. Repeatedly collecting new preference data from the latest LLMs may alleviate this issue, which unfortunately makes the resulting system more complicated and difficult to optimize. In this paper, we propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution. Specifically, an unsupervised multi-view learning method is introduced to learn robust representations of policy samples. Meanwhile, a synthetic preference generation approach is developed to simulate high-quality preference data with policy outputs. Extensive experiments on three benchmark datasets show that RLP consistently outperforms the state-of-the-art. Our code is available at \url{https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/rlp}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Anthropic. 2023. Introducing claude.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  3. Learning representations by maximizing mutual information across views. In NeurIPS.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  6. Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. The International Journal of Robotics Research.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  8. Language models are few-shot learners. In NeurIPS.
  9. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS.
  10. Open problems and fundamental limitations of reinforcement learning from human feedback. TMLR.
  11. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
  12. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757.
  13. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In ACL.
  14. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  15. Deep reinforcement learning from human preferences. In NeurIPS.
  16. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  17. Reward model ensembles help mitigate overoptimization. In ICLR.
  18. Alpacafarm: A simulation framework for methods that learn from human feedback. In NeurIPS.
  19. Learning robust representations via multi-view information bottleneck. In ICLR.
  20. Scaling laws for reward model overoptimization. In ICML.
  21. Koala: A dialogue model for academic research. Blog post, April, 1.
  22. Google. 2023. Bard.
  23. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR.
  24. Measuring massive multitask language understanding. In ICLR.
  25. Learning deep representations by mutual information estimation and maximization. In ICLR.
  26. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  27. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In ICLR.
  28. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  29. Instruction-following evaluation through verbalizer manipulation. arXiv preprint arXiv:2307.10558.
  30. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334.
  31. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
  32. The flan collection: Designing data and methods for effective instruction tuning. In ICML.
  33. Cross-task generalization via natural language crowdsourcing instructions. In ACL.
  34. OpenAI. 2023. Gpt-4 technical report.
  35. Training language models to follow instructions with human feedback. In NeurIPS.
  36. On variational bounds of mutual information. In ICML.
  37. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS.
  38. Multitask prompted training enables zero-shot task generalization. In ICLR.
  39. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  40. Prompting gpt-3 to be reliable. In ICLR.
  41. Defining and characterizing reward gaming. In NeurIPS.
  42. Learning to summarize with human feedback. In NeurIPS.
  43. Stanford alpaca: An instruction-following llama model.
  44. Causal confusion and reward misidentification in preference-based reward learning. In ICLR.
  45. The information bottleneck method. arXiv preprint physics/0004057.
  46. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  47. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  48. Self-consistency improves chain of thought reasoning in language models. In ICLR.
  49. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
  50. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477.
  51. Rlcd: Reinforcement learning from contrast distillation for language model alignment. In ICLR.
  52. Constructive large language models alignment with diverse feedback. arXiv preprint arXiv:2310.06450.
  53. Evaluating large language models at evaluating instruction following. In ICLR.
  54. Multi-view learning overview: Recent progress and new challenges. Information Fusion.
  55. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hao Lang (10 papers)
  2. Fei Huang (408 papers)
  3. Yongbin Li (128 papers)
Citations (2)