Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function (2404.12358v2)

Published 18 Apr 2024 in cs.LG
From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function

Abstract: Reinforcement Learning From Human Feedback (RLHF) has been critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level MDP, while DPO is derived as a bandit problem in which the whole response of the model is treated as a single arm. In this work we rectify this difference. We theoretically show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the BeLLMan equation. Using our theoretical results, we provide three concrete empirical insights. First, we show that because of its token level interpretation, DPO is able to perform some type of credit assignment. Next, we prove that under the token level formulation, classical search-based algorithms, such as MCTS, which have recently been applied to the language generation space, are equivalent to likelihood-based search on a DPO policy. Empirically we show that a simple beam search yields meaningful improvement over the base DPO policy. Finally, we show how the choice of reference policy causes implicit rewards to decline during training. We conclude by discussing applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.

Revisiting Token-Level Optimization in LLMs with Direct Preference Optimization

Introduction to DPO in Token-Level MDPs

Reinforcement Learning From Human Feedback (RLHF) remains a mainstay in aligning LLMs to human-defined objectives. Historically, these methods have relied heavily on Reward Models trained via human feedback, imposing considerable complexity. Direct Preference Optimization (DPO), a more recently formulated direct alignment method, simplifies the RLHF pipeline by bypassing the explicit reward model stage. Traditional approaches employ reinforcement learning to optimize token-level value functions derived from sparse reward signals collected upon completion of response generation. By contrast, DPO operates fundamentally at a single decision point perspective, akin to dealing with each complete response as an individual entity within a contextual bandits framework. This raises interesting theoretical challenges, particularly when translating DPO to function effectively at the token level.

DPO and Token-Level Derivations

A closer examination reveals that DPO can be theoretically integrated into the token-level Markov Decision Process (MDP) setting traditionally used for training LLMs. This transition involves interpreting DPO as a general inverse Q-learning algorithm, where the alignment between a model's responses and the desired reward signals manifested in human preferences can be realized across the sequential decision-making process of language generation.

Key Theoretical Insights

  • Token-Level Interpretation: When re-formulated to accommodate the sequential token generation process, DPO demonstrates the ability to perform credit assignment at each token, attributing different weights to each decision point based on its contribution to the final outcome.
  • Equivalence to Search-Based Methods: The foundational principles of DPO suggest that, under a token-level framework, it shares similarities with search-based algorithms such as Monte Carlo Tree Search (MCTS). Specifically, the theoretical re-interpretation posits that optimizing the DPO policy during the decoding phase is analogous to conducting a likelihood-based search across potential decision paths token by token.
  • Role of Initial Policy Choices: The deployment of different reference policies during training influences the trajectory of implicit rewards. Understanding this relationship can guide more effective training regimens that retain or enhance the model's adherence to desired behavior patterns.

Empirical Findings and Practical Applications

By applying the theoretical insights to practical scenarios, the paper also confirms several key phenomena:

  • Credit Assignment and Policy Improvement: Empirical results validate that DPO, when adapted to the token-level, can assign credit effectively and improve over the base policy using beam search strategies.
  • Dynamic Behavior of Implicit Rewards: Observations indicate that the selection of initial policies significantly affects the evolution of implicit rewards during training, underscoring the delicate balance required in setting up initial conditions for DPO.

Future Directions and Speculations

This work invites further exploration into the granular controls and theoretical implications of direct preference optimization methods, considering their potential to streamline and potentially enhance the training of LLMs for complex, nuanced tasks. The adaptation of DPO into a token-level MDP framework not only bridges a critical gap in applying bandit-like strategies to more expansive sequential decision-making landscapes but also reinforces the model's ability to engage in nuanced, contextually driven language generation tasks.

Exploring the intersection of DPO with other RL techniques and extending these concepts to multimodal contexts or more complex interaction frameworks could further uncover latent potentialities in generative model training methodologies. The continuous evolution of LLMs and their training algorithms promises a fertile ground for advancing our understanding and capabilities in AI-driven natural language understanding and generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Preference-based policy learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2011.
  2. Star-gate: Teaching language models to ask clarifying questions, 2024.
  3. A general theoretical paradigm to understand learning from human preferences, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback, 2022b.
  6. Improving image generation with better captions, 2023. URL https://cdn.openai.com/papers/dall-e-3.pdf.
  7. Pythia: A suite for analyzing large language models across training and scaling, 2023.
  8. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023a.
  9. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023b.
  10. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. doi: https://doi.org/10.2307/2334029.
  11. Dense reward for free in reinforcement learning from human feedback. arXiv preprint arXiv:2402.00782, 2024.
  12. Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
  13. Sequencematch: Imitation learning for autoregressive sequence modelling with backtracking. arXiv preprint arXiv:2306.05426, 2023.
  14. Alpacafarm: A simulation framework for methods that learn from human feedback, 2024.
  15. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
  16. Scaling rectified flow transformers for high-resolution image synthesis, 2024.
  17. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381, 2023.
  18. Alphazero-like tree-search can guide large language model decoding and training, 2024.
  19. Scaling laws for reward model overoptimization. International Conference on machine Learning, 2023.
  20. Iq-learn: Inverse soft-q learning for imitation, 2022.
  21. Photorealistic video generation with diffusion models, 2023.
  22. Inverse preference learning: Preference-based rl without a reward function. Advances in Neural Information Processing Systems, 36, 2024.
  23. Contrastive preference learning: Learning from human feedback without reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=iX1RjVQODj.
  24. V-star: Training verifiers for self-taught reasoners, 2024.
  25. Gaia-1: A generative world model for autonomous driving, 2023.
  26. Deal: Decoding-time alignment for large language models, 2024.
  27. Critic-guided decoding for controlled text generation. arXiv preprint arXiv:2212.10938, 2022.
  28. Models of human preference for learning reward functions. Transactions on Machine Learning Research, 2023.
  29. Learning optimal advantage from preferences and mistaking it for reward. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  10066–10073, 2024.
  30. Openassistant conversations – democratizing large language model alignment, 2023.
  31. Rewardbench: Evaluating reward models for language modeling, 2024.
  32. Aligning text-to-image models using human feedback. arXiv e-prints, pp.  arXiv–2302, 2023.
  33. Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018.
  34. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020.
  35. Learning to decode for future success. arXiv preprint arXiv:1701.06549, 2017.
  36. Don’t throw away your value model! making ppo even better via value-guided monte-carlo tree search decoding, 2023a.
  37. Making ppo even better: Value-guided monte-carlo tree search decoding. arXiv preprint arXiv:2309.15028, 2023b.
  38. Controlled decoding from language models. arXiv preprint arXiv:2310.17022, 2023.
  39. Controlled decoding from language models, 2024.
  40. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  41. Webgpt: Browser-assisted question-answering with human feedback, 2022.
  42. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pp.  278–287, 1999.
  43. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
  44. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024.
  45. Let’s reinforce step by step. arXiv preprint arXiv:2311.05821, 2023.
  46. Disentangling length from quality in direct preference optimization, 2024.
  47. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2305.18290.
  48. Proximal policy optimization algorithms, 2017.
  49. Offline rl for natural language generation with implicit language q learning. In The Eleventh International Conference on Learning Representations, 2022.
  50. Trial and error: Exploration-based trajectory optimization for llm agents, 2024.
  51. Learning to summarize from human feedback, 2022.
  52. Diffusion model alignment using direct preference optimization, 2023.
  53. A bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems, 2012.
  54. Fudge: Controlled text generation with future discriminators. arXiv preprint arXiv:2104.05218, 2021.
  55. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
  56. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023a.
  57. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023b.
  58. Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
  59. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp.  1433–1438. Chicago, IL, USA, 2008.
  60. Fine-tuning language models from human preferences, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Rafael Rafailov (37 papers)
  2. Joey Hejna (19 papers)
  3. Ryan Park (10 papers)
  4. Chelsea Finn (264 papers)
Citations (85)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com