Towards a Human-like Open-Domain Chatbot (2001.09977v3)

Published 27 Jan 2020 in cs.CL, cs.LG, cs.NE, and stat.ML

Abstract: We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated.

PDF Abstract

Towards a Human-like Open-Domain Chatbot

The paper "Towards a Human-like Open-Domain Chatbot" presents Meena, an end-to-end trained generative chatbot that uses a neural network architecture. Developed by researchers from Google Research, Brain Team, Meena is designed to engage in multi-turn open-domain conversations with a human-like sensibility and specificity.

Architectural Overview and Dataset

Meena is based on a seq2seq model using the Evolved Transformer architecture, comprising 2.6 billion parameters. It is trained on a diverse dataset of 40 billion words sourced and filtered from public domain social media conversations. The filtering process ensures the removal of nonsensical, unsafe, or overly repetitive content, resulting in a high-quality dataset that aids in training an effective conversational model.

The model is tasked with predicting the next token in a sequence to minimize perplexity, which is defined as the probability distribution's uncertainty when predicting a word given preceding context—lower perplexity indicates better predictive performance.

Sensibility and Specificity Average (SSA)

A novel human evaluation metric, the Sensibleness and Specificity Average (SSA), is introduced to quantify chatbot dialogue quality. Sensibleness ensures that responses make logical sense within the conversational context, while specificity ensures that the responses are contextually relevant and detailed, rather than vague and generic. The researchers demonstrate a strong correlation between model perplexity and SSA, validating perplexity as a surrogate for human judgment of conversation quality.

In static evaluation (using pre-set conversation prompts), Meena achieves an SSA of 72%. Interactive evaluations, where humans engage in free-form conversations with the chatbot, further confirm these findings, substantiating the robustness of the correlations observed.

Performance Comparison

Meena's performance is benchmarked against other well-known chatbots such as Mitsuku, Cleverbot, and XiaoIce. Meena's SSA of 72% outperforms the existing chatbots, with Mitsuku and Cleverbot scoring 56% each, and XiaoIce scoring significantly lower at 31%. The full version of Meena, which incorporates additional filtering mechanisms and tuned decoding, attains an even higher SSA of 79%. The paper highlights that Meena’s performance, while still below the human-level SSA (86%), is a substantial step forward in open-domain conversational AI.

Implications and Future Directions

The findings of this research hold significant theoretical and practical implications. From a theoretical perspective, the strong correlation between model perplexity and human-likeness metrics suggests that further improvements in lowering perplexity can potentially bring AI closer to human-level conversational abilities. Practically, Meena's architecture and training approach provide a scalable method for developing conversational agents capable of maintaining engaging and contextually appropriate multi-turn dialogues.

Future research avenues could focus on expanding the set of evaluation metrics beyond SSA to include attributes such as empathy, humor, and deeper question-answering capabilities. Exploring these dimensions can provide a more holistic measure of human-likeness and pave the way for further refinements in conversational AI.

Additionally, integrating mechanisms to handle long-term memory and enhancing model robustness against adversarial inputs are potential areas of development. Such advancements would better equip conversational agents to handle prolonged engagements and diverse user interactions, consolidating their utility in real-world applications.

Conclusion

The "Towards a Human-like Open-Domain Chatbot" paper demonstrates significant advancements in the field of conversational AI. Through meticulous design and evaluation, it underscores the potential of neural network-based end-to-end models in achieving near-human conversational quality. By setting a new benchmark with Meena, the research opens promising pathways for the future development of more sophisticated and human-like chatbots.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Daniel Adiwardana (1 paper)
Minh-Thang Luong (32 papers)
David R. So (11 papers)
Jamie Hall (5 papers)
Noah Fiedel (22 papers)
Romal Thoppilan (2 papers)
Zi Yang (33 papers)
Apoorv Kulshreshtha (2 papers)
Gaurav Nemade (2 papers)
Yifeng Lu (16 papers)
Quoc V. Le (128 papers)

Citations (887)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos