2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision (2410.19720v1)
Abstract: Recent advancements in Direct Preference Optimization (DPO) have significantly enhanced the alignment of LLMs with human preferences, owing to its simplicity and effectiveness. However, existing methods typically optimize a scalar score or ranking reward, thereby overlooking the multi-dimensional nature of human preferences. In this work, we propose to extend the preference of DPO to two dimensions: segments and aspects. We first introduce a 2D supervision dataset called HelpSteer-2D. For the segment dimension, we divide the response into sentences and assign scores to each segment. For the aspect dimension, we meticulously design several criteria covering the response quality rubrics. With the 2-dimensional signals as feedback, we develop a 2D-DPO framework, decomposing the overall objective into multi-segment and multi-aspect objectives. Extensive experiments on popular benchmarks demonstrate that 2D-DPO performs better than methods that optimize for scalar or 1-dimensional preferences.
- The hitchhiker’s guide to human alignment with* po. arXiv preprint arXiv:2407.15229.
- AI@Meta. 2024. Llama 3 model card.
- A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR.
- Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv preprint arXiv:2402.14762.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Drlc: Reinforcement learning with dense rewards from llm critic. arXiv preprint arXiv:2401.07382.
- Dense reward for free in reinforcement learning from human feedback. arXiv preprint arXiv:2402.00782.
- Improving large language models via fine-grained reinforcement learning with minimum editing constraint. arXiv preprint arXiv:2401.06081.
- Reinforcement learning with dynamic multi-reward weighting for multi-style controllable generation. arXiv preprint arXiv:2402.14146.
- Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
- Beyond bounding box: Multimodal knowledge learning for object detection. arXiv preprint arXiv:2205.04072.
- Beyond imitation: Leveraging fine-grained quality signals for alignment. arXiv preprint arXiv:2311.04072.
- Controllable preference optimization: Toward controllable multi-objective alignment. arXiv preprint arXiv:2402.19085.
- Measuring massive multitask language understanding. Preprint, arXiv:2009.03300.
- Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36.
- Bridging and modeling correlations in pairwise data for direct preference optimization. arXiv preprint arXiv:2408.07471.
- Sergey Levine. 2018. Reinforcement learning and control as probabilistic inference: Tutorial and review. ArXiv, abs/1805.00909.
- Graphreader: Building graph-based agent to enhance long-context abilities of large language models. arXiv preprint arXiv:2406.14550.
- From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050.
- Dream: Disentangling risks to enhance safety alignment in multimodal large language models. arXiv preprint arXiv.
- Iterative length-regularized direct preference optimization: A case study on improving 7b language models to gpt-4 level. arXiv preprint arXiv:2406.11817.
- Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734.
- Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In International Conference on Machine Learning, pages 26837–26867. PMLR.
- Gaia-universe: Everything is super-netify. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10):11856–11868.
- From r𝑟ritalic_r to q∗superscript𝑞q^{*}italic_q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. Advances in Neural Information Processing Systems, 36.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716.
- Reward collapse in aligning large language models. arXiv preprint arXiv:2305.17608.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. arXiv preprint arXiv:2402.18571.
- Helpsteer2: Open-source dataset for training top-performing reward models. Preprint, arXiv:2406.08673.
- Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082.
- Conceptmath: A bilingual concept-wise benchmark for measuring mathematical reasoning of large language models. arXiv preprint arXiv:2402.14660.
- Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671.
- Selective preference optimization via token-level reward function estimation. arXiv preprint arXiv:2408.13518.
- Rewards-in-context: Multi-objective alignment of foundation models with dynamic preference adjustment. arXiv preprint arXiv:2402.10207.
- Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback. arXiv preprint arXiv:2407.16574.
- Token-level direct preference optimization. arXiv preprint arXiv:2404.11999.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of the Association for Computational Linguistics ACL 2024, pages 10586–10613.
- Brian D. Ziebart. 2010. Modeling purposeful adaptive behavior with the principle of maximum causal entropy.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.