Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization (2402.10342v2)

Published 15 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed, and the algorithm uses trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, which provides insight into why a small amount of human feedback may be sufficient to achieve good performance with RLHF. A key novelty is a trajectory-level elliptical potential analysis, which bounds the reward estimation error when comparison feedback (rather than numerical reward observation) is given. We provide and analyze algorithms PG-RLHF and NN-PG-RLHF for two settings: linear and neural function approximation, respectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yihan Du (17 papers)
  2. Anna Winnicki (9 papers)
  3. Gal Dalal (30 papers)
  4. Shie Mannor (228 papers)
  5. R. Srikant (90 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets