UltraFeedback: Boosting Language Models with Scaled AI Feedback (2310.01377v2)

Published 2 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Learning from human feedback has become a pivot technique in aligning LLMs with human preferences. However, acquiring vast and premium human feedback is bottlenecked by time, labor, and human capability, resulting in small sizes or limited topics of current datasets. This further hinders feedback learning as well as alignment research within the open-source community. To address this issue, we explore how to go beyond human feedback and collect high-quality \textit{AI feedback} automatically for a scalable alternative. Specifically, we identify \textbf{scale and diversity} as the key factors for feedback data to take effect. Accordingly, we first broaden instructions and responses in both amount and breadth to encompass a wider range of user-assistant interactions. Then, we meticulously apply a series of techniques to mitigate annotation biases for more reliable AI feedback. We finally present \textsc{UltraFeedback}, a large-scale, high-quality, and diversified AI feedback dataset, which contains over 1 million GPT-4 feedback for 250k user-assistant conversations from various aspects. Built upon \textsc{UltraFeedback}, we align a LLaMA-based model by best-of-$n$ sampling and reinforcement learning, demonstrating its exceptional performance on chat benchmarks. Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat LLMs, serving as a solid foundation for future feedback learning research. Our data and models are available at https://github.com/thunlp/UltraFeedback.

PDF Abstract

A Summary of "UltraFeedback: Boosting LLMs with High-quality Feedback"

The paper introduces UltraFeedback, a substantial and diversified preference dataset aimed at enhancing reinforcement learning from human feedback (RLHF) for LLMs. The LLMs, like GPT-4 and ChatGPT, have become increasingly skillful at generating coherent text and managing diverse language-oriented tasks. However, their proficiency relies heavily on the data used, which often misaligns them from human preferences due to the limitations of likelihood maximization. This misalignment can result in outputs that are seemingly accurate but fundamentally incorrect or hazardous.

Motivation and Objectives

Reinforcement learning from human feedback (RLHF) has been instrumental in aligning LLMs with human values, as demonstrated by major AI entities like OpenAI and Anthropic. Despite proprietary successes, a significant void persists within the open-source community due to the lack of comprehensive, high-quality preference datasets. Existing datasets often fall short in terms of scale, diversity, or annotation accuracy, consequently hindering further research and application of RLHF in open-source LLMs.

UltraFeedback Dataset

UltraFeedback addresses these challenges by offering a vast, high-quality, diversified dataset designed to promote RLHF development. The dataset compiles instructions and models from multiple sources to create over 340k pieces of comparative data, including feedback and annotations provided by GPT-4. This comprehensive approach ensures data diversity and maintains high annotation standards across four key areas: instruction-following, truthfulness, honesty, and helpfulness.

Scale and Diversity: UltraFeedback is the largest of its kind, featuring 64k instructions, each accompanied by four diverse responses, resulting in an unmatched breadth of 256,000 model completions.
Fine-grained Annotations: The dataset is meticulously annotated, with feedback obtained through GPT-4, which ensures high-quality data and establishes reliable evaluation methods. Each response is scored across multiple dimensions, complete with numerical ratings and textual feedback.
Open Source and Extendable: Featuring a reproducible and expandable dataset generation pipeline, UltraFeedback is positioned as a foundational tool for researchers aiming to extend the frontier of RLHF.

Model Training and Performance

The authors utilize UltraFeedback data to train several models, showcasing the dataset's effectiveness. Key models include the reward model UltraRM, the chat model UltraLM-13B-PPO, and the critique model UltraCM. Experiments indicate these models outperform existing open-source alternatives, achieving top results across numerous benchmarks, including AlpacaEval, Evol-Instruct, and UltraChat.

Implications and Future Directions

The implications of UltraFeedback are profound, offering a scalable, high-quality dataset tailored to improve the efficacy of RLHF in LLM alignment. Beyond facilitating groundbreaking research, UltraFeedback serves as a catalyst for developing more human-aligned LLMs. This progress is crucial for enhancing interactions in various applications, from casual user interactions to high-stakes decision-making environments like healthcare and autonomous systems.

The paper further highlights UltraFeedback's potential role in transforming LLM evaluation by training critique models (like UltraCM) that deliver precise, actionable insights for model outputs. As LLMs continue to evolve, UltraFeedback offers a route to more nuanced and human-compatible models.

Future development on this front will likely focus on expanding dataset diversity, especially regarding multi-turn dialogues and tasks involving complex reasoning, coding, and safety scenarios. Moreover, further exploration into refining RLHF processes and feedback learning methods may emerge as a direct continuation of this work.

In summary, UltraFeedback stands as a pivot towards aligning AI with human expectations more closely while providing a robust groundwork for continued advancements in open-source LLM development.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Ganqu Cui (39 papers)
Lifan Yuan (22 papers)
Ning Ding (122 papers)
Guanming Yao (1 paper)
Wei Zhu (290 papers)
Yuan Ni (11 papers)
Guotong Xie (31 papers)
Zhiyuan Liu (433 papers)
Maosong Sun (337 papers)
Bingxiang He (8 papers)
Ruobing Xie (97 papers)
Yankai Lin (125 papers)

Citations (279)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - OpenBMB/UltraFeedback: A large-scale, fine-grained, diverse preference dataset (and models). (299 stars)