Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration (2402.16030v1)

Published 25 Feb 2024 in cs.CL and cs.AI

Abstract: While Reinforcement Learning from Human Feedback (RLHF) significantly enhances the generation quality of LLMs, recent studies have raised concerns regarding the complexity and instability associated with the Proximal Policy Optimization (PPO) algorithm, proposing a series of order-based calibration methods as viable alternatives. This paper delves further into current order-based methods, examining their inefficiencies in utilizing reward values and addressing misalignment issues. Building upon these findings, we propose a novel \textbf{V}alue-based \textbf{C}ali\textbf{B}ration (VCB) method to better align LLMs with human preferences. Experimental results demonstrate that VCB surpasses existing alignment methods on AI assistant and summarization datasets, providing impressive generalizability, robustness, and stability in diverse settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  5. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  6. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723.
  7. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC.
  8. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2, pages 1735–1742. IEEE.
  9. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
  10. Revisiting design choices in proximal policy optimization. arXiv preprint arXiv:2009.10897.
  11. A Ya Khinchin. 2013. Mathematical foundations of information theory. Courier Corporation.
  12. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673.
  13. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. Advances in Neural Information Processing Systems, 35:16203–16220.
  14. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  15. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.
  16. Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
  17. Boosting the speed of entity alignment 10×\times×: Dual attention matching network with normalized hard sample mining. In Proceedings of the Web Conference 2021, pages 821–832.
  18. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  19. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  20. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  21. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  22. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  23. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492.
  24. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  25. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  26. Self-rewarding language models. arXiv preprint arXiv:2401.10020.
  27. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
  28. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848.
  29. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
  30. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
Citations (4)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.