Emergent Mind

Abstract

The success of AI assistants based on Language Models (LLMs) hinges on Reinforcement Learning from Human Feedback (RLHF) to comprehend and align with user intentions. However, traditional alignment algorithms, such as PPO, are hampered by complex annotation and training requirements. This reliance limits the applicability of RLHF and hinders the development of professional assistants tailored to diverse human preferences. In this work, we introduce \textit{Linear Alignment}, a novel algorithm that aligns language models with human preferences in one single inference step, eliminating the reliance on data annotation and model training. Linear alignment incorporates a new parameterization for policy optimization under divergence constraints, which enables the extraction of optimal policy in a closed-form manner and facilitates the direct estimation of the aligned response. Extensive experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment across diverse scenarios. Our code and dataset is published on \url{https://github.com/Wizardcoast/Linear_Alignment.git}.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  2. Constitutional AI: Harmlessness from AI Feedback
  3. Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences. The International Journal of Robotics Research, 41:45 – 67, 2020. https://api.semanticscholar.org/CorpusID:220055497.

  4. Measuring Progress on Scalable Oversight for Large Language Models
  5. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
  6. Everyone Deserves A Reward: Learning Customized Human Preferences
  7. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
  8. Improving Factuality and Reasoning in Language Models through Multiagent Debate
  9. A Minimalist Approach to Offline Reinforcement Learning
  10. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2022. https://api.semanticscholar.org/CorpusID:252992904.

  11. The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers
  12. Improving alignment of dialogue agents via targeted human judgements
  13. Reinforced Self-Training (ReST) for Language Modeling
  14. Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment
  15. Measuring Massive Multitask Language Understanding
  16. Specific versus General Principles for Constitutional AI
  17. The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback
  18. RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
  19. Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
  20. Contrastive decoding: Open-ended text generation as optimization. In Annual Meeting of the Association for Computational Linguistics, 2022. https://api.semanticscholar.org/CorpusID:253157949.

  21. RAIN: Your Language Models Can Align Themselves without Finetuning
  22. Let's Verify Step by Step
  23. The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
  24. Self-Refine: Iterative Refinement with Self-Feedback
  25. Teaching language models to support answers with verified quotes
  26. Confronting Reward Model Overoptimization with Constrained RLHF
  27. Controlled Decoding from Language Models
  28. WebGPT: Browser-assisted question-answering with human feedback
  29. Parameter-Efficient Detoxification with Contrastive Decoding
  30. OpenAI. Introducing chatgpt. 2022. https://openai.com/blog/chatgpt.

  31. Training language models to follow instructions with human feedback
  32. Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning. Int. J. Hum. Comput. Stud., 160:102772, 2020. https://api.semanticscholar.org/CorpusID:220380881.

  33. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  34. Sentence-bert: Sentence embeddings using siamese bert-networks. In Conference on Empirical Methods in Natural Language Processing, 2019. https://api.semanticscholar.org/CorpusID:201646309.

  35. Trust Region Policy Optimization
  36. Proximal Policy Optimization Algorithms
  37. Deterministic policy gradient algorithms. In International Conference on Machine Learning, 2014. https://api.semanticscholar.org/CorpusID:13928442.

  38. Learning to summarize from human feedback
  39. Aligning Large Multimodal Models with Factually Augmented RLHF
  40. SALMON: Self-Alignment with Instructable Reward Models
  41. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
  42. Llama 2: Open Foundation and Fine-Tuned Chat Models
  43. Secrets of RLHF in Large Language Models Part II: Reward Modeling
  44. Fundamental Limitations of Alignment in Large Language Models
  45. RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Show All 45