Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback (2401.11458v3)

Published 21 Jan 2024 in cs.CL

Abstract: The success of AI assistants based on LLMs hinges on Reinforcement Learning from Human Feedback (RLHF) to comprehend and align with user intentions. However, traditional alignment algorithms, such as PPO, are hampered by complex annotation and training requirements. This reliance limits the applicability of RLHF and hinders the development of professional assistants tailored to diverse human preferences. In this work, we introduce \textit{Linear Alignment}, a novel algorithm that aligns LLMs with human preferences in one single inference step, eliminating the reliance on data annotation and model training. Linear alignment incorporates a new parameterization for policy optimization under divergence constraints, which enables the extraction of optimal policy in a closed-form manner and facilitates the direct estimation of the aligned response. Extensive experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment across diverse scenarios. Our code and dataset is published on \url{https://github.com/Wizardcoast/Linear_Alignment.git}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862, 2022a. URL https://api.semanticscholar.org/CorpusID:248118878.
  2. Constitutional ai: Harmlessness from ai feedback. ArXiv, abs/2212.08073, 2022b. URL https://api.semanticscholar.org/CorpusID:254823489.
  3. Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. The International Journal of Robotics Research, 41:45 – 67, 2020. URL https://api.semanticscholar.org/CorpusID:220055497.
  4. Measuring progress on scalable oversight for large language models. ArXiv, abs/2211.03540, 2022. URL https://api.semanticscholar.org/CorpusID:253384413.
  5. Open problems and fundamental limitations of reinforcement learning from human feedback. ArXiv, abs/2307.15217, 2023. URL https://api.semanticscholar.org/CorpusID:260316010.
  6. Everyone deserves a reward: Learning customized human preferences. ArXiv, abs/2309.03126, 2023. URL https://api.semanticscholar.org/CorpusID:261557043.
  7. Raft: Reward ranked finetuning for generative foundation model alignment. ArXiv, abs/2304.06767, 2023. URL https://api.semanticscholar.org/CorpusID:258170300.
  8. Improving factuality and reasoning in language models through multiagent debate. ArXiv, abs/2305.14325, 2023. URL https://api.semanticscholar.org/CorpusID:258841118.
  9. A minimalist approach to offline reinforcement learning. ArXiv, abs/2106.06860, 2021. URL https://api.semanticscholar.org/CorpusID:235422620.
  10. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2022. URL https://api.semanticscholar.org/CorpusID:252992904.
  11. The benefits of bad advice: Autocontrastive decoding across model layers. arXiv preprint arXiv:2305.01628, 2023.
  12. Improving alignment of dialogue agents via targeted human judgements. ArXiv, abs/2209.14375, 2022. URL https://api.semanticscholar.org/CorpusID:252596089.
  13. Reinforced self-training (rest) for language modeling. ArXiv, abs/2308.08998, 2023. URL https://api.semanticscholar.org/CorpusID:261031028.
  14. Beyond imitation: Leveraging fine-grained quality signals for alignment. ArXiv, abs/2311.04072, 2023. URL https://api.semanticscholar.org/CorpusID:265043685.
  15. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  16. Specific versus general principles for constitutional ai. ArXiv, abs/2310.13798, 2023. URL https://api.semanticscholar.org/CorpusID:264426105.
  17. The alignment ceiling: Objective mismatch in reinforcement learning from human feedback. ArXiv, abs/2311.00168, 2023. URL https://api.semanticscholar.org/CorpusID:264832734.
  18. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. ArXiv, abs/2309.00267, 2023. URL https://api.semanticscholar.org/CorpusID:261493811.
  19. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922, 2023.
  20. Contrastive decoding: Open-ended text generation as optimization. In Annual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID:253157949.
  21. Rain: Your language models can align themselves without finetuning. ArXiv, abs/2309.07124, 2023. URL https://api.semanticscholar.org/CorpusID:261705563.
  22. Let’s verify step by step. ArXiv, abs/2305.20050, 2023. URL https://api.semanticscholar.org/CorpusID:258987659.
  23. The unlocking spell on base llms: Rethinking alignment via in-context learning. ArXiv, abs/2312.01552, 2023. URL https://api.semanticscholar.org/CorpusID:265608902.
  24. Self-refine: Iterative refinement with self-feedback. ArXiv, abs/2303.17651, 2023. URL https://api.semanticscholar.org/CorpusID:257900871.
  25. Teaching language models to support answers with verified quotes. ArXiv, abs/2203.11147, 2022. URL https://api.semanticscholar.org/CorpusID:247594830.
  26. Confronting reward model overoptimization with constrained rlhf. ArXiv, abs/2310.04373, 2023. URL https://api.semanticscholar.org/CorpusID:263829192.
  27. Controlled decoding from language models. ArXiv, abs/2310.17022, 2023. URL https://api.semanticscholar.org/CorpusID:264491118.
  28. Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332, 2021. URL https://api.semanticscholar.org/CorpusID:245329531.
  29. Parameter-efficient detoxification with contrastive decoding. arXiv preprint arXiv:2401.06947, 2024.
  30. OpenAI. Introducing chatgpt. 2022. URL https://openai.com/blog/chatgpt.
  31. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022. URL https://api.semanticscholar.org/CorpusID:246426909.
  32. Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning. Int. J. Hum. Comput. Stud., 160:102772, 2020. URL https://api.semanticscholar.org/CorpusID:220380881.
  33. Direct preference optimization: Your language model is secretly a reward model. ArXiv, abs/2305.18290, 2023. URL https://api.semanticscholar.org/CorpusID:258959321.
  34. Sentence-bert: Sentence embeddings using siamese bert-networks. In Conference on Empirical Methods in Natural Language Processing, 2019. URL https://api.semanticscholar.org/CorpusID:201646309.
  35. Trust region policy optimization. ArXiv, abs/1502.05477, 2015. URL https://api.semanticscholar.org/CorpusID:16046818.
  36. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017. URL https://api.semanticscholar.org/CorpusID:28695052.
  37. Deterministic policy gradient algorithms. In International Conference on Machine Learning, 2014. URL https://api.semanticscholar.org/CorpusID:13928442.
  38. Learning to summarize from human feedback. ArXiv, abs/2009.01325, 2020. URL https://api.semanticscholar.org/CorpusID:221665105.
  39. Aligning large multimodal models with factually augmented rlhf. ArXiv, abs/2309.14525, 2023a. URL https://api.semanticscholar.org/CorpusID:262824780.
  40. Salmon: Self-alignment with principle-following reward models. ArXiv, abs/2310.05910, 2023b. URL https://api.semanticscholar.org/CorpusID:263831633.
  41. Principle-driven self-alignment of language models from scratch with minimal human supervision. ArXiv, abs/2305.03047, 2023c. URL https://api.semanticscholar.org/CorpusID:258479665.
  42. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023. URL https://api.semanticscholar.org/CorpusID:259950998.
  43. Secrets of rlhf in large language models part ii: Reward modeling. arXiv preprint arXiv:2401.06080, 2024.
  44. Fundamental limitations of alignment in large language models. ArXiv, abs/2304.11082, 2023. URL https://api.semanticscholar.org/CorpusID:258291526.
  45. Rrhf: Rank responses to align language models with human feedback without tears. ArXiv, abs/2304.05302, 2023. URL https://api.semanticscholar.org/CorpusID:258059818.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Songyang Gao (28 papers)
  2. Qiming Ge (5 papers)
  3. Wei Shen (181 papers)
  4. Shihan Dou (46 papers)
  5. Junjie Ye (66 papers)
  6. Xiao Wang (507 papers)
  7. Rui Zheng (78 papers)
  8. Yicheng Zou (20 papers)
  9. Zhi Chen (235 papers)
  10. Hang Yan (86 papers)
  11. Qi Zhang (784 papers)
  12. Dahua Lin (336 papers)
Citations (8)