Fine-Tuning Language Models from Human Preferences (1909.08593v2)

Published 18 Sep 2019 in cs.CL, cs.LG, and stat.ML

Abstract: Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of LLMs to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

PDF Abstract

Fine-Tuning LLMs from Human Preferences

The paper "Fine-Tuning LLMs from Human Preferences" presents a novel approach to optimizing NLP models using reinforcement learning (RL) with human feedback, thereby bridging recent advancements in generative pre-training and reward learning.

Introduction and Motivation

The central objective of this research is to apply RL to complex tasks defined primarily by human judgment—tasks that necessitate assessing the quality of results based on subjective human evaluations rather than algorithmically defined reward functions. While previous work has focused on simulated environments and simpler tasks, the authors emphasize the growing need for RL models capable of handling the richness and nuances inherent in natural language.

Methodology

To achieve the integration of RL with LLMs, the researchers employed a multi-step approach:

Pretraining: They started with a large generative LLM—specifically a 774M parameter version of GPT-2—trained on an unsupervised corpus (WebText).
Reward Model Training: Human labelers evaluated different text continuations generated by the model, choosing those that best met the task-specific goals such as positive sentiment, descriptiveness, or summarization quality.
Reinforcement Learning Fine-Tuning: The pretrained model was then fine-tuned using RL where the reward was derived from the human preference-trained reward model. To stabilize the training and ensure the fine-tuned model did not deviate too drastically from the original LLM, a KL constraint was applied.

Experimental Setup

The experiments spanned four primary tasks:

Stylistic Continuation: This involved continuing text in a way that matched a target style (positive sentiment or vividly descriptive language).
Summarization: Two datasets were used for summarization tasks: CNN/Daily Mail and TL;DR.

Results and Findings

Stylistic Continuation

With minimal data (5,000 human comparisons), the fine-tuned model achieved substantial performance increases. For positive sentiment tasks, the fine-tuned model was preferred by humans 86% of the time over the zero-shot baseline and 77% over a mock sentiment classifier.
Fine-tuning based on human preferences proved more effective than using programmatic rewards such as those based on heuristic sentiment models.

Summarization

Fine-tuned models using human data exhibited a smart-copying behavior, where they predominantly copied relevant sentences from the input, skipping irrelevant text.
For CNN/Daily Mail, models trained with 60,000 human comparisons improved both ROUGE scores and human preference ratings compared to zero-shot and supervised fine-tuning baselines.
However, models tended to overfit the human labeling heuristics. The 60,000-label model was preferred even over human-written summaries, indicating potential bias in human labelers relying heavily on copying as an accuracy heuristic.

Insights and Challenges

One key insight from the experiments is the disparity between training objectives and evaluation metrics. The fine-tuned models, especially for summarization, leaned heavily towards extractive summarization due to the heuristics used by human labelers, highlighting the limitations of imperfect human feedback.

Challenges identified in the paper include:

Data Quality and Online Data Collection: Ensuring consistent data quality in an online setting where RL models continually adapt introduces complexity in quality control.
Parameter Sharing and Overfitting: Overfitting issues arose when jointly training the reward model and policy, given the limited amount of human-labeled data compared to the extensive RL episodes.
Task Ambiguity: The inherent ambiguity in subjective human-rating tasks made it difficult to achieve consistent and reproducible labeler performance.

Implications and Future Directions

This paper underscores the importance of human feedback in training AI systems for tasks that lack clear-cut algorithmic definitions. Practically, this has implications for building safer and more aligned AI systems capable of sophisticated language tasks through continuous improvement based on human judgments.

Future research might explore enhanced data collection strategies such as batched data collection to balance human feedback latency and model improvement, as well as active learning techniques to maximize the informative value of human labels.

This work also paves the way for advanced interactive systems, where LLMs can incorporate ongoing human feedback, ultimately contributing to scalable reward learning frameworks and safely advancing AI capabilities.