Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback (2402.02479v2)

Published 4 Feb 2024 in cs.LG, cs.AI, cs.CL, and cs.HC

Abstract: Distribution matching methods for LLM alignment such as Generation with Distributional Control (GDC) and Distributional Policy Gradient (DPG) have not received the same level of attention in reinforcement learning from human feedback (RLHF) as contrastive methods such as Sequence Likelihood Calibration (SLiC), Direct Preference Optimization (DPO) and its variants. We identify high variance of the gradient estimate as the primary reason for the lack of success of these methods and propose a self-normalized baseline to reduce the variance. We further generalize the target distribution in DPG, GDC and DPO by using Bayes' rule to define the reward-conditioned posterior. The resulting approach, referred to as BRAIn - Bayesian Reward-conditioned Amortized Inference acts as a bridge between distribution matching methods and DPO and significantly outperforms prior art in summarization and Antropic HH tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Gaurav Pandey (51 papers)
  2. Yatin Nandwani (12 papers)
  3. Tahira Naseem (27 papers)
  4. Mayank Mishra (38 papers)
  5. Guangxuan Xu (13 papers)
  6. Dinesh Raghu (19 papers)
  7. Sachindra Joshi (32 papers)
  8. Asim Munawar (29 papers)
  9. Ramón Fernandez Astudillo (29 papers)
Citations (2)