Rewarding Chatbots for Real-World Engagement with Millions of Users (2303.06135v2)

Published 10 Mar 2023 in cs.CL, cs.AI, and cs.LG

Abstract: The emergence of pretrained LLMs has led to the deployment of a range of social chatbots for chitchat. Although these chatbots demonstrate language ability and fluency, they are not guaranteed to be engaging and can struggle to retain users. This work investigates the development of social chatbots that prioritize user engagement to enhance retention, specifically examining the use of human feedback to efficiently develop highly engaging chatbots. The proposed approach uses automatic pseudo-labels collected from user interactions to train a reward model that can be used to reject low-scoring sample responses generated by the chatbot model at inference time. Intuitive evaluation metrics, such as mean conversation length (MCL), are introduced as proxies to measure the level of engagement of deployed chatbots. A/B testing on groups of 10,000 new daily chatbot users on the Chai Research platform shows that this approach increases the MCL by up to 70%, which translates to a more than 30% increase in user retention for a GPT-J 6B model. Future work aims to use the reward model to realise a data fly-wheel, where the latest user conversations can be used to alternately fine-tune the LLM and the reward model.

PDF Abstract

Overview of "Rewarding Chatbots for Real-World Engagement with Millions of Users"

The paper "Rewarding Chatbots for Real-World Engagement with Millions of Users" presents a rigorous examination of social chatbots deployed in conversational scenarios, emphasizing user engagement and retention. The paper proposes leveraging human feedback to enhance the engagement level of chatbots, focusing on the efficiency of pseudo-label-based training for reward models which rejects low-scoring responses generated by chatbot models at inference time.

Key Contributions

This work introduces an innovative method that utilizes automatic pseudo-labels derived from user interactions to train a reward model. This model enhances the chatbot's performance by evaluating and selecting responses that are likely to maximize user engagement. The authors propose intuitive metrics such as mean conversation length (MCL) to proxy the engagement levels effectively. The empirical validation conducted through A/B testing on a platform with a sizable user base shows a notable increase in MCL and consequent improvements in user retention.

Methodology

The paper deploys a three-stage pipeline similar to strategies used in training InstructGPT models. The process begins with fine-tuning pre-trained LLMs on domain-specific conversational and literary data. This is followed by crafting a reward model that dynamically learns the engagement value of responses. The paper introduces best-of-N sampling rejection during inference. Here, multiple responses are generated, and the reward model selects the response with the highest engagement score.

The inclusion of novel pseudo-labeling strategies stands out, where engaging responses are inferred from conversational metrics such as conversational continuation and retry rate. This approach bypasses the expensive and labor-intensive nature of manual data annotations by automatically deriving labels from user interactions.

Experimental Validation and Results

Experiments are performed using a GPT-J 6B model on the Chai Research platform which boasts millions of daily interactions. A suite of experiments reveals significant improvements in MCL, with numbers indicating up to a 70% increase in conversation lengths compared to a baseline lacking a reward mechanism. Crucially, these improvements translate to a more than 30% increase in user retention, affirming the premise that reward models informed by human feedback substantially boost engagement.

Implications and Future Directions

The paper emphasizes bridging the gap between language fluency and engagement in chatbots. By incorporating user engagement as a metric for chatbot evaluation, this work moves beyond conventional model training focused solely on language coherence. Practically, this enhances the value propositions of commercial social chatbots, aligning with goals like increased user retention and platform longevity.

Theoretically, this research hints at the ability to scale and refine feedback-loop-driven training of LLMs. The potential for further automation in feedback collection, possibly through advanced interaction analytics, presents a fertile avenue for subsequent investigation. Moreover, exploring hybrid approaches that combine human and automatic feedback models represents a promising direction to improve engagement while balancing computational resources.

Overall, the paper advocates for a shift in chatbot design philosophy—prioritizing user engagement directly in automated response selection mechanisms. As this field advances, insights from this work could inform broader AI systems where user interaction modulation is critical.