Training language models to follow instructions with human feedback (2203.02155v1)

Published 4 Mar 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Making LLMs bigger does not inherently make them better at following a user's intent. For example, LLMs can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning LLMs with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning LLMs with human intent.

PDF Abstract

Training LLMs to Follow Instructions with Human Feedback: A Study Summary

This essay provides an expert overview of the paper titled "Training LLMs to Follow Instructions with Human Feedback," authored by Long Ouyang et al. and affiliated with OpenAI.

Overview

The central premise of the paper is that merely increasing the model size, such as GPT-3, does not inherently result in models that align well with user intents. LLMs often produce outputs that are untruthful, toxic, or unaligned with user instructions. The authors address this problem by presenting a methodology for aligning LLMs with user intent using Reinforcement Learning from Human Feedback (RLHF).

Methodology

Data Collection

The methodology begins by collecting a dataset of human demonstrations and rankings. The authors hired a team of 40 contractors to label data and provide demonstrations of desired behavior from the LLM. This dataset forms the basis for fine-tuning models using supervised learning methods and further improving them through RLHF.

Supervised Fine-Tuning (SFT)

The first step involves fine-tuning GPT-3 on human demonstrations using supervised learning. This forms a baseline model that aligns with the kind of behavior exhibited in the provided demonstrations.

Reward Modeling (RM)

The next step trains a reward model to predict human preferences. Using human-labeled comparisons of model outputs, the reward model is designed to capture which outputs are preferred by humans, giving higher scores to more preferred outputs.

Reinforcement Learning via Proximal Policy Optimization (PPO)

Finally, the authors employ Proximal Policy Optimization (PPO) to fine-tune the model further, using the reward model as a scalar reward. This PPO-based model, termed "InstructGPT," aims to maximize human preference when producing outputs.

Key Findings

InstructGPT vs. GPT-3

The paper demonstrates that outputs from the InstructGPT models are significantly preferred over those from GPT-3, despite the former having 100x fewer parameters in some cases. This is a strong numerical result indicating that model alignment with human preferences can be more impactful than sheer model size.

The 1.3B parameter InstructGPT model is preferred over the 175B parameter GPT-3 model in evaluations.

Improvements in Truthfulness and Toxicity

InstructGPT models exhibit noticeable improvements in truthfulness and reductions in toxicity. For instance, on the TruthfulQA benchmark, InstructGPT generates truthful answers about twice as often as GPT-3. However, improvements in addressing bias remain limited.

Mitigating Performance Regression

The authors note that mixing gradients from public NLP datasets during RLHF can mitigate performance regressions on these datasets, ensuring that the models remain versatile and perform well across diverse tasks.

Generalization and Safety

The paper finds that InstructGPT models can generalize to handle a variety of new instructions, ranging from following instructions in other languages to handling coding tasks. However, it also acknowledges that the models can still make simple mistakes and follow harmful instructions when explicitly requested.

Implications and Future Directions

Theoretical and Practical Advances

From a theoretical perspective, the paper highlights the importance of human feedback in guiding LLMs towards aligned goals. This empirical evidence strengthens the case for RLHF as a promising direction for future alignment research.

Practically, the approach demonstrated in the paper has implications for deploying safer, more reliable AI systems in real-world applications. Notably, it shows that alignment techniques can significantly improve user satisfaction and safety without incurring prohibitive computational costs.

Challenges and Open Questions

The work raises important questions about the scalability and ethics of aligning AI models with human feedback. Key challenges include:

Extending the alignment process to different languages and cultural contexts.
Improving the models' robustness to adversarial inputs.
Balancing alignment with performance on traditional NLP benchmarks.

Future research may explore more complex human feedback methods, such as adversarial data collection and multi-step feedback loops. Additionally, integrating alignment with other safety measures, like filtering training data and using specialized control codes, could be promising avenues.

Conclusion

The paper "Training LLMs to Follow Instructions with Human Feedback" provides compelling evidence that reinforcing LLMs with human preferences significantly enhances their alignment with user intents. While much work remains to improve the safety and reliability of such models, this paper marks an important step in the ongoing endeavor to align AI systems with human values and intentions.