Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning (2407.02119v2)

Published 2 Jul 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement learning with human feedback (RLHF), as a widely adopted approach in current LLM pipelines, is \textit{bottlenecked by the size of human preference data}. While traditional methods rely on offline preference dataset constructions, recent approaches have shifted towards online settings, where a learner uses a small amount of labeled seed data and a large pool of unlabeled prompts to iteratively construct new preference data through self-generated responses and high-quality reward/preference feedback. However, most current online algorithms still focus on preference labeling during policy model updating with given feedback oracles, which incurs significant expert query costs. \textit{We are the first to explore cost-effective proxy reward oracles construction strategies for further labeling preferences or rewards with extremely limited labeled data and expert query budgets}. Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries. Using these methods, we train a evaluation model with minimal expert-labeled data, which then effectively labels nine times more preference pairs for further RLHF training. For instance, our model using Direct Preference Optimization (DPO) gains around over 1% average improvement on AlpacaEval2, MMLU-5shot and MMLU-0shot, with only 1.7K query cost. Our methodology is orthogonal to other direct expert query-based strategies and therefore might be integrated with them to further reduce query costs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Gone fishing: Neural active learning with fisher embeddings. Advances in Neural Information Processing Systems, 34:8927–8939.
  2. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. An experimental design framework for label-efficient supervised finetuning of large language models. arXiv preprint arXiv:2401.06692.
  5. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335.
  6. Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944.
  7. Combinatorial optimisation. Wiley-Interscience Series in Discrete Mathematics and Optimization, USA, 1:998.
  8. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
  9. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863.
  10. Rebel: Reinforcement learning via regressing relative rewards.
  11. Yonatan Geifman and Ran El-Yaniv. 2017. Deep active learning over the long tail. arXiv preprint arXiv:1711.00941.
  12. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  13. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178.
  14. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
  15. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787.
  16. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  17. Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373.
  18. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  19. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  20. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715.
  21. Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations.
  22. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  24. Iterative dpo alignment. Technical report, Snorkel AI.
  25. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508.
  26. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022.
  27. Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675.
  28. Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.
  29. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models.
  30. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682.
  31. Self-rewarding language models. arXiv preprint arXiv:2401.10020.
  32. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  33. Starling-7b: Improving llm helpfulness & harmlessness with rlaif.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yifang Chen (31 papers)
  2. Shuohang Wang (69 papers)
  3. Ziyi Yang (77 papers)
  4. Hiteshi Sharma (12 papers)
  5. Nikos Karampatziakis (28 papers)
  6. Donghan Yu (18 papers)
  7. Kevin Jamieson (72 papers)
  8. Simon Shaolei Du (20 papers)
  9. Yelong Shen (83 papers)