Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Averaging log-likelihoods in direct alignment (2406.19188v1)

Published 27 Jun 2024 in cs.LG

Abstract: To better align LLMs with human judgment, Reinforcement Learning from Human Feedback (RLHF) learns a reward model and then optimizes it using regularized RL. Recently, direct alignment methods were introduced to learn such a fine-tuned model directly from a preference dataset without computing a proxy reward function. These methods are built upon contrastive losses involving the log-likelihood of (dis)preferred completions according to the trained model. However, completions have various lengths, and the log-likelihood is not length-invariant. On the other side, the cross-entropy loss used in supervised training is length-invariant, as batches are typically averaged token-wise. To reconcile these approaches, we introduce a principled approach for making direct alignment length-invariant. Formally, we introduce a new averaging operator, to be composed with the optimality operator giving the best policy for the underlying RL problem. It translates into averaging the log-likelihood within the loss. We empirically study the effect of such averaging, observing a trade-off between the length of generations and their scores.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Nathan Grinsztajn (17 papers)
  2. Yannis Flet-Berliac (16 papers)
  3. Mohammad Gheshlaghi Azar (31 papers)
  4. Florian Strub (39 papers)
  5. Bill Wu (1 paper)
  6. Eugene Choi (9 papers)
  7. Chris Cremer (5 papers)
  8. Arash Ahmadian (18 papers)
  9. Yash Chandak (32 papers)
  10. Olivier Pietquin (90 papers)
  11. Matthieu Geist (93 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com