Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy (2403.04283v1)
Abstract: Reinforcement Learning from Human Feedback (RLHF) is the prevailing approach to ensure LLMs align with human values. However, existing RLHF methods require a high computational cost, one main reason being that RLHF assigns both the generation and alignment tasks to the LLM simultaneously. In this paper, we introduce Proxy-RLHF, which decouples the generation and alignment processes of LLMs, achieving alignment with human values at a much lower computational cost. We start with a novel Markov Decision Process (MDP) designed for the alignment process and employ Reinforcement Learning (RL) to train a streamlined proxy model that oversees the token generation of the LLM, without altering the LLM itself. Experiments show that our method achieves a comparable level of alignment with only 1\% of the training parameters of other methods.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
- Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973.
- An empirical survey on long document summarization: Datasets, models, and metrics. ACM computing surveys, 55(8):1–35.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Rrhf: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Yu Zhu (123 papers)
- Chuxiong Sun (12 papers)
- Wenfei Yang (18 papers)
- Wenqiang Wei (5 papers)
- Bo Tang (111 papers)
- Tianzhu Zhang (60 papers)
- Zhiyu Li (69 papers)
- Shifeng Zhang (46 papers)
- Feiyu Xiong (53 papers)
- Jie Hu (187 papers)
- Mingchuan Yang (10 papers)