Preference Ranking Optimization for Human Alignment (2306.17492v2)

Published 30 Jun 2023 in cs.CL and cs.AI

Abstract: LLMs often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.

PDF HTML Abstract

Preference Ranking Optimization for Human Alignment: A Comprehensive Review

The paper "Preference Ranking Optimization for Human Alignment" introduces Preference Ranking Optimization (PRO) as an innovative technique for aligning LLMs with human preferences, addressing issues inherent in existing frameworks such as Reinforcement Learning from Human Feedback (RLHF). The authors emphasize the need for more efficient and stable optimization methods to ensure LLMs produce outputs that align with human values and preferences, aiming to mitigate the complexities and instabilities associated with RLHF.

Key Contributions

The authors propose PRO as an efficient supervised fine-tuning (SFT) algorithm that directly optimizes preference rankings derived from human feedback, transforming human alignment from pair-wise contrasts to multi-dimensional ranking processes. Instead of depending on trial-and-error experiences typical of RLHF, PRO directly subjects the LLMs to human preference ranking with candidates ordered by human evaluators. The PRO algorithm involves extending pair-wise contrasts using the Bradley-Terry model to encompass rankings of arbitrary lengths, enhancing sampling from the linguistic space to better align LLMs with human preferences.

Methodology

PRO leverages a multi-positional one-to-N contrast approach that allows LLMs to treat the best response as the positive while considering remaining responses as negatives in successive iterations until all candidates have been analyzed. By increasing the candidate set, PRO enables the LLM to access more samples from the linguistic space and efficiently distinguish human-preferred features from negative examples. Moreover, the algorithm incorporates a dynamic temperature design for differentiated contrast between candidates to refine the model's training process.

To further enhance PRO, the authors graft RLHF characteristics onto PRO, such as affordable preference rankings and differentiated contrasts, offering increased flexibility in the human alignment process. By implementing self-bootstrapping methods, PRO can dynamically sample new candidates, effectively extending preference ranking sequences and promoting a more nuanced understanding of human values.

Experimental Results

In a series of evaluations comparing PRO against several baselines, including RLHF, SFT, and CoH, PRO demonstrated superior alignment with human preferences across various datasets, including HH-RLHF, Harmless, and Helpful sets. Notably, PRO surpassed existing algorithms, showing comparable results to state-of-the-art LLMs like ChatGPT in automatic-based, reward-based, GPT-4, and human evaluations.

From zero-shot baselines to models fine-tuned on LLaMA-7B, PRO exhibited a substantial improvement, particularly in reward scores while maintaining commendable BLEU scores. Further experiments explored the effect of increasing candidate ranking lengths from 2 to 5 using responses generated by leading LLMs such as Alpaca and ChatGPT. Longer rankings yielded better results, underscoring the significance of diversity and quality of added candidates in shaping performance improvements.

Implications and Future Directions

The findings suggest that PRO offers a promising alternative to RLHF, providing a robust framework for aligning LLMs with human preferences without the associated complexity and instability. The ability to expand preference rankings dynamically, coupled with differentiated contrast, presents immense potential for refining LLMs' alignment with human values.

The paper opens avenues for future research in optimizing reward models and exploring the scaling of self-bootstrapping benefits across larger LLMs. Enhancing contextual understanding and developing finer-grained supervisory strategies could further bolster the efficacy and reliability of human-aligned LLMs.

Conclusion

Preference Ranking Optimization emerges as a significant methodological advancement in fine-tuning LLMs with human alignment, demonstrating proficiency in capturing human preferences while ensuring high-quality outputs. By leveraging extended rankings and dynamic contrast mechanisms, PRO positions itself as a foundational technique in the pursuit of secure and ethically aligned AI systems. The research poses intriguing possibilities for advancing AI alignment strategies and broadening the scope of human-centric AI development.

PDF Markdown Bookmark Chat (Pro)

References (45)

Authors (7)

Feifan Song (14 papers)
Bowen Yu (89 papers)
Minghao Li (44 papers)
Haiyang Yu (109 papers)
Fei Huang (408 papers)
Yongbin Li (128 papers)
Houfeng Wang (43 papers)

Citations (197)

View on Semantic Scholar