Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning (2407.02119v2)
Abstract: Reinforcement learning with human feedback (RLHF), as a widely adopted approach in current LLM pipelines, is \textit{bottlenecked by the size of human preference data}. While traditional methods rely on offline preference dataset constructions, recent approaches have shifted towards online settings, where a learner uses a small amount of labeled seed data and a large pool of unlabeled prompts to iteratively construct new preference data through self-generated responses and high-quality reward/preference feedback. However, most current online algorithms still focus on preference labeling during policy model updating with given feedback oracles, which incurs significant expert query costs. \textit{We are the first to explore cost-effective proxy reward oracles construction strategies for further labeling preferences or rewards with extremely limited labeled data and expert query budgets}. Our approach introduces two key innovations: (1) on-policy query to avoid OOD and imbalance issues in seed data, and (2) active learning to select the most informative data for preference queries. Using these methods, we train a evaluation model with minimal expert-labeled data, which then effectively labels nine times more preference pairs for further RLHF training. For instance, our model using Direct Preference Optimization (DPO) gains around over 1% average improvement on AlpacaEval2, MMLU-5shot and MMLU-0shot, with only 1.7K query cost. Our methodology is orthogonal to other direct expert query-based strategies and therefore might be integrated with them to further reduce query costs.
- Gone fishing: Neural active learning with fisher embeddings. Advances in Neural Information Processing Systems, 34:8927–8939.
- Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- An experimental design framework for label-efficient supervised finetuning of large language models. arXiv preprint arXiv:2401.06692.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335.
- Batch active learning at scale. Advances in Neural Information Processing Systems, 34:11933–11944.
- Combinatorial optimisation. Wiley-Interscience Series in Discrete Mathematics and Optimization, USA, 1:998.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
- Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863.
- Rebel: Reinforcement learning via regressing relative rewards.
- Yonatan Geifman and Ran El-Yaniv. 2017. Deep active learning over the long tail. arXiv preprint arXiv:1711.00941.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178.
- Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
- Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Confronting reward model overoptimization with constrained rlhf. arXiv preprint arXiv:2310.04373.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715.
- Ozan Sener and Silvio Savarese. 2018. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations.
- Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Iterative dpo alignment. Technical report, Snorkel AI.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022.
- Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675.
- Exploratory preference optimization: Harnessing implicit q*-approximation for sample-efficient rlhf.
- Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models.
- Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
- Starling-7b: Improving llm helpfulness & harmlessness with rlaif.
- Yifang Chen (31 papers)
- Shuohang Wang (69 papers)
- Ziyi Yang (77 papers)
- Hiteshi Sharma (12 papers)
- Nikos Karampatziakis (28 papers)
- Donghan Yu (18 papers)
- Kevin Jamieson (72 papers)
- Simon Shaolei Du (20 papers)
- Yelong Shen (83 papers)