Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Preference Ranking Optimization for Human Alignment (2306.17492v2)

Published 30 Jun 2023 in cs.CL and cs.AI

Abstract: LLMs often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.

Preference Ranking Optimization for Human Alignment: A Comprehensive Review

The paper "Preference Ranking Optimization for Human Alignment" introduces Preference Ranking Optimization (PRO) as an innovative technique for aligning LLMs with human preferences, addressing issues inherent in existing frameworks such as Reinforcement Learning from Human Feedback (RLHF). The authors emphasize the need for more efficient and stable optimization methods to ensure LLMs produce outputs that align with human values and preferences, aiming to mitigate the complexities and instabilities associated with RLHF.

Key Contributions

The authors propose PRO as an efficient supervised fine-tuning (SFT) algorithm that directly optimizes preference rankings derived from human feedback, transforming human alignment from pair-wise contrasts to multi-dimensional ranking processes. Instead of depending on trial-and-error experiences typical of RLHF, PRO directly subjects the LLMs to human preference ranking with candidates ordered by human evaluators. The PRO algorithm involves extending pair-wise contrasts using the Bradley-Terry model to encompass rankings of arbitrary lengths, enhancing sampling from the linguistic space to better align LLMs with human preferences.

Methodology

PRO leverages a multi-positional one-to-N contrast approach that allows LLMs to treat the best response as the positive while considering remaining responses as negatives in successive iterations until all candidates have been analyzed. By increasing the candidate set, PRO enables the LLM to access more samples from the linguistic space and efficiently distinguish human-preferred features from negative examples. Moreover, the algorithm incorporates a dynamic temperature design for differentiated contrast between candidates to refine the model's training process.

To further enhance PRO, the authors graft RLHF characteristics onto PRO, such as affordable preference rankings and differentiated contrasts, offering increased flexibility in the human alignment process. By implementing self-bootstrapping methods, PRO can dynamically sample new candidates, effectively extending preference ranking sequences and promoting a more nuanced understanding of human values.

Experimental Results

In a series of evaluations comparing PRO against several baselines, including RLHF, SFT, and CoH, PRO demonstrated superior alignment with human preferences across various datasets, including HH-RLHF, Harmless, and Helpful sets. Notably, PRO surpassed existing algorithms, showing comparable results to state-of-the-art LLMs like ChatGPT in automatic-based, reward-based, GPT-4, and human evaluations.

From zero-shot baselines to models fine-tuned on LLaMA-7B, PRO exhibited a substantial improvement, particularly in reward scores while maintaining commendable BLEU scores. Further experiments explored the effect of increasing candidate ranking lengths from 2 to 5 using responses generated by leading LLMs such as Alpaca and ChatGPT. Longer rankings yielded better results, underscoring the significance of diversity and quality of added candidates in shaping performance improvements.

Implications and Future Directions

The findings suggest that PRO offers a promising alternative to RLHF, providing a robust framework for aligning LLMs with human preferences without the associated complexity and instability. The ability to expand preference rankings dynamically, coupled with differentiated contrast, presents immense potential for refining LLMs' alignment with human values.

The paper opens avenues for future research in optimizing reward models and exploring the scaling of self-bootstrapping benefits across larger LLMs. Enhancing contextual understanding and developing finer-grained supervisory strategies could further bolster the efficacy and reliability of human-aligned LLMs.

Conclusion

Preference Ranking Optimization emerges as a significant methodological advancement in fine-tuning LLMs with human alignment, demonstrating proficiency in capturing human preferences while ensuring high-quality outputs. By leveraging extended rankings and dynamic contrast mechanisms, PRO positions itself as a foundational technique in the pursuit of secure and ethically aligned AI systems. The research poses intriguing possibilities for advancing AI alignment strategies and broadening the scope of human-centric AI development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs. arXiv preprint arXiv:2305.08844.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  4. Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  9. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  10. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.
  11. Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760.
  12. Reward design with language models. arXiv preprint arXiv:2303.00001.
  13. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826.
  14. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091.
  15. Interacting with non-cooperative user: A new paradigm for proactive dialogue policy. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 212–222, New York, NY, USA. Association for Computing Machinery.
  16. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
  17. Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676.
  18. Interactive learning from policy-dependent human feedback. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2285–2294. PMLR.
  19. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
  20. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  21. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  22. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  23. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  25. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  26. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  27. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  28. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  29. Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871.
  30. Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29.
  31. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  32. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
  33. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  35. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  36. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  37. Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  38. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  39. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693.
  40. Reinforcement learning from diverse human preferences. arXiv preprint arXiv:2301.11774.
  41. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
  42. The wisdom of hindsight makes language models better instruction followers. arXiv preprint arXiv:2302.05206.
  43. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  44. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  45. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Feifan Song (14 papers)
  2. Bowen Yu (89 papers)
  3. Minghao Li (44 papers)
  4. Haiyang Yu (109 papers)
  5. Fei Huang (408 papers)
  6. Yongbin Li (128 papers)
  7. Houfeng Wang (43 papers)
Citations (197)