Improving alignment of dialogue agents via targeted human judgements (2209.14375v1)

Published 28 Sep 2022 in cs.LG and cs.CL

Abstract: We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted LLM baselines. We use reinforcement learning from human feedback to train our models with two new additions to help human raters judge agent behaviour. First, to make our agent more helpful and harmless, we break down the requirements for good dialogue into natural language rules the agent should follow, and ask raters about each rule separately. We demonstrate that this breakdown enables us to collect more targeted human judgements of agent behaviour and allows for more efficient rule-conditional reward models. Second, our agent provides evidence from sources supporting factual claims when collecting preference judgements over model statements. For factual questions, evidence provided by Sparrow supports the sampled response 78% of the time. Sparrow is preferred more often than baselines while being more resilient to adversarial probing by humans, violating our rules only 8% of the time when probed. Finally, we conduct extensive analyses showing that though our model learns to follow our rules it can exhibit distributional biases.

Citations (455)

View on Semantic Scholar

Summary

The paper introduces Sparrow, a dialogue agent that leverages targeted human judgments and RLHF to enhance helpfulness and reduce rule violations.
The methodology combines preference feedback with rule-specific assessments, yielding 68% preference success and 78% evidence-supported accuracy.
The paper demonstrates that human oversight and evidence integration can significantly improve dialogue system reliability while addressing residual bias challenges.

Improving Alignment of Dialogue Agents via Targeted Human Judgments

The paper presents a framework for enhancing the alignment of dialogue agents, specifically focusing on a system named Sparrow. The approach involves reinforcement learning from human feedback (RLHF) and aims to train agents to be more helpful, correct, and harmless compared to previous models. Two primary mechanisms are introduced to refine human judgment and agent performance: breaking down dialogue requirements into distinct natural language rules and providing evidence-supported factual claims.

Methodology and Models

Sparrow is developed using Chinchilla-70B, initiated with dialogue prompting and trained under specific conditions utilizing human feedback. The process involves:

Targeted Human Judgments: Human annotators guide the judgment process by assessing violations based on predefined rules, such as avoiding threats or providing financial advice. This decomposition allows for precise data collection and training of rule-specific classifiers.
Multi-objective RLHF: By leveraging both preference and rule violation feedback, Sparrow achieves a balance between helpfulness and compliance with rules. This combination is shown to outperform simpler approaches like prompting or supervised learning.
Inline Evidence: Sparrow is designed to retrieve and present evidence supporting its responses, enhancing both the correctness and verifiability of its outputs. For instance, evidence-supported answers were accurate 78% of the time, reflecting an improvement in response quality.

Results and Analysis

The paper presents several key results highlighting Sparrow's efficacy:

Preference and Rule Compliance: Sparrow is favored over baseline models in 68% of comparisons, and its adherence to rules under adversarial probing is noted at only 8% rule violations.
Supported and Plausible Claims: The model provides plausible and supported responses 78% of the time when evidence is included. This shows significant improvement in the delivery of factually correct information.
Generalization and Bias: Though promising, Sparrow still exhibits distributional biases, amplifying concerns about fairness despite improvements in rule-following.

Implications

Practically, the work suggests advancements in dialogue system design, enabling models to harness human-like reasoning with enhanced accuracy and reliability. Theoretically, it represents a step towards more aligned AI systems that effectively balance helpfulness with ethical considerations.

Future Directions

Several avenues for future research are proposed:

Dialogue as Supervision: Further exploration of dialogue as a supervisory tool can refine agent behavior, especially with debate-based models.
Extended Rule Sets: Implementing larger, more complex sets of rules might enable finer control over agent behavior.
Cognitive Science Integration: Applying findings from cognitive science can help in understanding user interaction and training more intuitive agents.

Overall, this work exemplifies how structured human oversight can enhance the alignment and performance of AI dialogue systems, providing a foundation for future advancements in aligned AI development.

Related Papers

YouTube

Show All Videos