Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning (2407.00617v3)

Published 30 Jun 2024 in cs.LG, cs.AI, cs.CL, and cs.GT

Abstract: Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning LLMs with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art online RLHF algorithms.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yuheng Zhang (86 papers)
  2. Dian Yu (78 papers)
  3. Baolin Peng (72 papers)
  4. Linfeng Song (76 papers)
  5. Ye Tian (190 papers)
  6. Mingyue Huo (2 papers)
  7. Nan Jiang (210 papers)
  8. Haitao Mi (56 papers)
  9. Dong Yu (328 papers)
Citations (6)