Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-Tuning Language Models with Advantage-Induced Policy Alignment (2306.02231v3)

Published 4 Jun 2023 in cs.CL, cs.AI, cs.LG, cs.SY, and eess.SY

Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning LLMs to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, we also provide a theoretical justification supporting the design of our loss function.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Banghua Zhu (38 papers)
  2. Hiteshi Sharma (12 papers)
  3. Felipe Vieira Frujeri (8 papers)
  4. Shi Dong (20 papers)
  5. Chenguang Zhu (100 papers)
  6. Michael I. Jordan (438 papers)
  7. Jiantao Jiao (83 papers)
Citations (36)