Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model (2504.15843v2)

Published 22 Apr 2025 in cs.CL

Abstract: Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for LLMs by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Junshu Pan (4 papers)
  2. Wei Shen (181 papers)
  3. Shulin Huang (12 papers)
  4. Qiji Zhou (8 papers)
  5. Yue Zhang (620 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com