Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Token-level Direct Preference Optimization (2404.11999v5)

Published 18 Apr 2024 in cs.CL and cs.AI
Token-level Direct Preference Optimization

Abstract: Fine-tuning pre-trained LLMs is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.

Token-level Direct Preference Optimization: Enhancing LLM Alignment with Human Preferences

Introduction

LLMs have become central to contemporary AI research due to their ability to generalize across various textual tasks. Traditional methods such as Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) have made significant strides in aligning these models with human-like responses. However, these approaches often assess divergence at a sentence level and may not effectively control the token-level generation diversity. This paper introduces Token-level Direct Preference Optimization (TDPO), a novel approach focusing on token-level optimization to improve alignment and manage generative diversity more efficiently.

Advances in Direct Preference Optimization

TDPO offers a conceptual leap from traditional DPO by shifting the focus from sentence-level evaluation to token-level optimization. This method integrates forward KL divergence constraints at the token level, allowing for finer control over the model's output, aligning more closely with human preferences, and preserving generative diversity. This paper leverages the Bradley-Terry model for token-based preference assessment, enhancing traditional methods without the necessity for explicit reward model training or policy sampling during training.

Experimental Framework and Results

The authors conducted extensive experiments to validate TDPO's effectiveness across different datasets, including IMDb for sentiment generation and Anthropic HH for single-turn dialogues. The results were impressive, showing that TDPO outperforms both DPO and PPO-based RLHF methods in generating quality responses. Specifically, TDPO achieved better balance and control over KL divergence and showcased superior divergence efficiency.

Key insights include:

  • Alignment and Diversity: TDPO significantly improves the balance between model alignment with human preferences and generative diversity compared to existing methods.
  • KL Divergence Control: By optimizing at the token level, TDPO provides a more nuanced control of KL divergence, leading to more stable and consistent model performance across different textual tasks.

Implications and Future Directions

The introduction of TDPO marks a pivotal development in the training of LLMs. Looking forward, this method opens new avenues for research into fine-tuning LLMs at a more granular level. Future work may explore the potential of token-level optimizations in other aspects of LLMing, such as reducing toxicity or bias in generated texts. Additionally, further research could extend this approach to other forms of media, like generative models for audio or video, where fine-grained control over generative processes is crucial.

Conclusion

Token-level Direct Preference Optimization represents a significant refinement over sentence-level optimization techniques used in LLMs. By effectively addressing the challenges of divergence efficiency and maintaining a balance between alignment and diversity, TDPO sets a new standard for the development of human-aligned AI systems. The method's ability to fine-tune generative attributes at a token level will likely influence future LLM research and applications, making it a cornerstone in the ongoing evolution of machine learning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  3. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  4. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. Evaluating large language models trained on code, 2021.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  9. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  12. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  13. Pal: Program-aided language models, 2023.
  14. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  15. trlx: A framework for large scale reinforcement learning from human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  8578–8595, 2023.
  16. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  17. Mistral 7b, 2023.
  18. A distributional approach to controlled text generation. arXiv preprint arXiv:2012.11635, 2020.
  19. Models of human preference for learning reward functions. arXiv preprint arXiv:2206.02231, 2022.
  20. Learning optimal advantage from preferences and mistaking it for reward. arXiv preprint arXiv:2310.02456, 2023.
  21. An empirical survey on long document summarization: Datasets, models, and metrics. ACM Computing Surveys, 55(8):1–35, December 2022. ISSN 1557-7341. doi: 10.1145/3545176. URL http://dx.doi.org/10.1145/3545176.
  22. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  23. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023.
  24. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp.  142–150, 2011.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. Red teaming language models with language models, 2022. URL https://arxiv. org/abs/2202.03286.
  27. Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  28. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  29. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  31. Characteristics of harmful text: Towards rigorous benchmarking of language models, 2022.
  32. Trust region policy optimization. In International conference on machine learning, pp.  1889–1897. PMLR, 2015.
  33. Societal biases in language generation: Progress and challenges. arXiv preprint arXiv:2105.04054, 2021.
  34. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023.
  35. Learning to summarize from human feedback, 2022.
  36. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  37. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  39. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  40. Koala: An index for quantifying overlaps with pre-training corpora, 2023.
  41. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints. arXiv preprint arXiv:2309.16240, 2023.
  42. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
  43. On decoding strategies for neural text generators. Transactions of the Association for Computational Linguistics, 10:997–1012, 2022.
  44. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  45. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023.
  46. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  47. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yongcheng Zeng (5 papers)
  2. Guoqing Liu (42 papers)
  3. Weiyu Ma (5 papers)
  4. Ning Yang (49 papers)
  5. Haifeng Zhang (58 papers)
  6. Jun Wang (990 papers)
Citations (20)
Youtube Logo Streamline Icon: https://streamlinehq.com