Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization (2405.15230v2)

Published 24 May 2024 in cs.AI and cs.LG

Abstract: While astonishingly capable, LLMs (LLM) can sometimes produce outputs that deviate from human expectations. Such deviations necessitate an alignment phase to prevent disseminating untruthful, toxic, or biased information. Traditional alignment methods based on reinforcement learning often struggle with the identified instability, whereas preference optimization methods are limited by their overfitting to pre-collected hard-label datasets. In this paper, we propose a novel LLM alignment framework named $i$REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization. Particularly, $i$REPO employs self-generated datasets labeled by empirical human (or AI annotator) preference to iteratively refine the aligned policy through a novel regression-based loss function. Furthermore, we introduce an innovative algorithm backed by theoretical guarantees for achieving optimal results under ideal assumptions and providing a practical performance-gap result without such assumptions. Experimental results with Phi-2 and Mistral-7B demonstrate that $i$REPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators. Furthermore, our approach surpasses preference optimization baselines in evaluations using the LLM Evaluation Harness and Multi-turn benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Bias and fairness in large language models: A survey, 2024.
  2. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  3. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  4. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9.
  5. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023.
  6. A general theoretical paradigm to understand learning from human preferences, 2023.
  7. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  8. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  9. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024.
  10. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024.
  11. Online iterative reinforcement learning from human feedback with general preference model, 2024.
  12. Nash learning from human feedback, 2023.
  13. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715, 2024.
  14. A minimaximalist approach to reinforcement learning from human feedback, 2024.
  15. Self-play preference optimization for language model alignment, 2024.
  16. Fine-tuning language models from human preferences, 2020.
  17. Provably robust dpo: Aligning language models with noisy feedback, 2024.
  18. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  19. Kto: Model alignment as prospect theoretic optimization, 2024a.
  20. Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=xbjSwwrQOe.
  21. Orpo: Monolithic preference optimization without reference model, 2024.
  22. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  23. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024b.
  24. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  25. Tong Zhang. Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press, 2023.
  26. Alyssa Hughes. Phi-2: The surprising power of small language models, December 2023. URL https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
  27. Zephyr: Direct distillation of lm alignment, 2023a.
  28. Enhancing chat language models by scaling high-quality instructional conversations, 2023.
  29. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
  30. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  31. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023a.
  32. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  33. Hellaswag: Can a machine really finish your sentence?, 2019.
  34. Measuring massive multitask language understanding, 2021.
  35. Truthfulqa: Measuring how models mimic human falsehoods, 2022.
  36. WINOGRANDE: an adversarial winograd schema challenge at scale, 2019.
  37. The alignment handbook. https://github.com/huggingface/alignment-handbook, 2023b.
  38. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  39. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  40. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  41. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3406703. URL https://doi.org/10.1145/3394486.3406703.
  42. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b.
  43. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  44. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Long Tan Le (7 papers)
  2. Han Shu (14 papers)
  3. Tung-Anh Nguyen (6 papers)
  4. Choong Seon Hong (165 papers)
  5. Nguyen H. Tran (45 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com