Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy (2403.04283v1)

Published 7 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Reinforcement Learning from Human Feedback (RLHF) is the prevailing approach to ensure LLMs align with human values. However, existing RLHF methods require a high computational cost, one main reason being that RLHF assigns both the generation and alignment tasks to the LLM simultaneously. In this paper, we introduce Proxy-RLHF, which decouples the generation and alignment processes of LLMs, achieving alignment with human values at a much lower computational cost. We start with a novel Markov Decision Process (MDP) designed for the alignment process and employ Reinforcement Learning (RL) to train a streamlined proxy model that oversees the token generation of the LLM, without altering the LLM itself. Experiments show that our method achieves a comparable level of alignment with only 1\% of the training parameters of other methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  3. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
  4. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773.
  7. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
  8. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973.
  9. An empirical survey on long document summarization: Datasets, models, and metrics. ACM computing surveys, 55(8):1–35.
  10. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  11. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477.
  12. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  13. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  14. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
  15. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  16. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  17. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  18. Rrhf: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
  19. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Yu Zhu (123 papers)
  2. Chuxiong Sun (12 papers)
  3. Wenfei Yang (18 papers)
  4. Wenqiang Wei (5 papers)
  5. Bo Tang (111 papers)
  6. Tianzhu Zhang (60 papers)
  7. Zhiyu Li (69 papers)
  8. Shifeng Zhang (46 papers)
  9. Feiyu Xiong (53 papers)
  10. Jie Hu (187 papers)
  11. Mingchuan Yang (10 papers)
Citations (3)