Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2204.05862v1)

Published 12 Apr 2022 in cs.CL and cs.LG

Abstract: We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune LLMs to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.

PDF Abstract

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

This paper, authored by Anthropic researchers, details the process and efficacy of applying preference modeling and reinforcement learning from human feedback (RLHF) to fine-tune LLMs for the roles of helpful and harmless assistants. The authors systematically explore the alignment of large-scale LLMs with human-defined goals through iterative online training, a method that updates models and datasets on a weekly cadence using fresh human feedback.

Key Contributions

Data Collection and Crowdworker Interface

The authors employed a multi-phase approach to collect diverse and high-quality human feedback. This involved interaction with LLMs through a chat interface, where crowdworkers were instructed to ask the models for assistance with various tasks. The models provided two possible responses at each conversational turn, from which the crowdworker chose the more helpful one for helpfulness tasks and the more harmful one for harmlessness tasks. The authors collected data in three stages: an initial base dataset, a rejection sampling dataset, and an iterated online dataset. Ultimately, the dataset comprised:

Base Dataset: 44k helpfulness comparisons and 42k harmlessness comparisons.
Rejection Sampling (RS) Dataset: 52k helpfulness comparisons and 2k harmlessness comparisons.
Iterated Online Dataset: 22k helpfulness comparisons collected with updated RLHF models.

Preference Modeling and Reinforcement Learning

The paper implemented and tested a range of preference models (PMs) scaled from a 13M to a 52B parameter space to measure the predictive power and calibration of PMs in identifying helpful and harmless responses. Models trained on a mixture of helpful and harmless data consistently manifested better performance, underscoring the compatibility of these objectives at larger model scales.

The reinforcement learning framework utilized Proximal Policy Optimization (PPO) with a reward signal derived from the PM output. Removing the entropy penalty and minimizing KL divergence showed promising alignment improvements without compromising model performance. The iterative online training improved model robustness and data quality, enabling the authors to fine-tune models continually based on evolving high-quality datasets.

Evaluation and Results

NLP Evaluations: The authors assessed the models on various NLP benchmarks, including MMLU, Lambada, HellaSwag, OpenBookQA, ARC, and TriviaQA. The fine-tuned RLHF models demonstrated superior performance on these benchmarks compared to their generative counterparts. Specifically, RLHF improved zero-shot performance for larger models across all tasks except TriviaQA. Further, these models retained specialized skills (e.g., Python coding) post alignment training, affirming the compatibility of alignment training with specialized skill training.

Alignment Evaluations: Effectiveness of the models' alignment was measured using both static and dynamic evaluations:

TruthfulQA and BBQ-Lite: These benchmarks revealed improvements in model honesty and bias mitigation through RLHF training.
Human Evaluations: Utilizing Elo scores to quantify model preference rates based on crowdworker feedback showed that both helpful-only and helpful + harmless models outperformed a context-distilled base model, approaching or slightly surpassing human-written prompts in helpfulness tasks.

Implications and Future Directions

The paper substantiates the effectiveness of RLHF in training LLMs to act as both helpful and harmless assistants. These technologies have practical applications ranging from improved customer service experiences to ensuring safe deployment of AI in sensitive domains. The key results suggest that alignment interventions do not impose a performance tax on large models; rather, they may confer an "alignment bonus."

Promising future directions include:

Iterated Online Training: Continued refinement of this method could yield progressively better alignment performance.
Enhanced Robustness: Identifying and mitigating failures in preference modeling robustness, and overfitting during reinforcement learning.
Worst-case Behavior Mitigation: Addressing harmful outputs even in out-of-distribution or adversarial settings to ensure safety and reliability, particularly for deployment in high-stakes environments.
Real-World Applications: Extending these findings to specialized contexts, such as medically relevant interactions or high-risk decision-making support.

The authors also highlight the need for publicly available normative datasets and evaluations for broader societal alignment and safety research. Sharing such datasets facilitates collaboration, reproducibility, and transparency in advancing AI alignment.

Conclusion

This paper provides a comprehensive roadmap for employing RLHF to enhance the alignment of LLMs with human-defined helpful and harmless objectives. Addressing the duality of helpfulness and harmlessness in alignment training, the research sets a precedent for future iterations and applications of ethically sound AI models.