Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration (2312.00267v1)

Published 1 Dec 2023 in cs.LG, cs.AI, and stat.ML

Abstract: Preference-based feedback is important for many applications in reinforcement learning where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback (RLHF) on LLMs. For many applications of RLHF, the cost of acquiring the human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and formalize this as an offline contextual dueling bandit problem. We give an upper-confidence-bound style algorithm for this problem and prove a polynomial worst-case regret bound. We then provide empirical confirmation in a synthetic setting that our approach outperforms existing methods. After, we extend the setting and methodology for practical use in RLHF training of LLMs. Here, our method is able to reach better performance with fewer samples of human preferences than multiple baselines on three real-world datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Viraj Mehta (12 papers)
  2. Vikramjeet Das (2 papers)
  3. Ojash Neopane (7 papers)
  4. Yijia Dai (5 papers)
  5. Ilija Bogunovic (44 papers)
  6. Jeff Schneider (99 papers)
  7. Willie Neiswanger (68 papers)
Citations (17)