RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems (1701.03079v2)

Published 11 Jan 2017 in cs.CL, cs.HC, and cs.IR

Abstract: Open-domain human-computer conversation has been attracting increasing attention over the past few years. However, there does not exist a standard automatic evaluation metric for open-domain dialog systems; researchers usually resort to human annotation for model evaluation, which is time- and labor-intensive. In this paper, we propose RUBER, a Referenced metric and Unreferenced metric Blended Evaluation Routine, which evaluates a reply by taking into consideration both a groundtruth reply and a query (previous user-issued utterance). Our metric is learnable, but its training does not require labels of human satisfaction. Hence, RUBER is flexible and extensible to different datasets and languages. Experiments on both retrieval and generative dialog systems show that RUBER has a high correlation with human annotation.

PDF Abstract

An Expert Analysis of Ruber: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems

The paper "Ruber: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems" introduces an innovative approach towards evaluating conversational AI systems, particularly open-domain dialogue systems. The authors propose Ruber, an evaluation metric that blends referenced and unreferenced methodologies to gauge the quality of a dialog system's responses.

Core Contributions

Dual Metric Approach: Ruber's framework is built around two core components: the referenced metric and the unreferenced metric. The referenced metric evaluates the similarity between a generated response and a groundtruth response using an embedding-based similarity measure. This contrasts with traditional metrics like BLEU and ROUGE, which rely heavily on word overlap and have shown weak correlation with human evaluation in dialog settings.
Unreferenced Metric: The unreferenced metric in Ruber measures the semantic relatedness between the reply and the conversational query using a neural network trained via negative sampling. This aspect of Ruber is particularly noteworthy as it does not depend on human-annotated scores for training, allowing for flexibility and adaptability across different datasets and languages.
Blended Evaluation Routine: By combining these two metrics, Ruber capitalizes both on the advantages of measuring closeness to the groundtruth and on capturing relevant responses through query-reply relatedness. Different strategies for blending, such as averaging and minimum operator, are explored to enhance evaluation robustness.

Empirical Validation

The authors validate Ruber on widely used dialog system architectures, including retrieval-based and sequence-to-sequence models, demonstrating its superior correlation with human evaluation scores compared to existing automatic metrics. The correlation is quantified using Pearson and Spearman coefficients, with Ruber showing a stronger alignment with human judgment than BLEU and ROUGE scores, as well as the preceding neural network-based metrics that rely on human annotation for training.

Practical Implications

From a practical standpoint, Ruber's unsupervised nature and high degree of correlation with human judgments make it an attractive tool for dialog system developers. It reduces the reliance on costly human annotations for evaluating conversational models and can be easily configured for diverse conversational data with minimal adjustments. The adaptable nature of Ruber extends its utility across different domains, offering an efficient evaluative mechanism for rapidly evolving dialog technologies.

Theoretical Implications and Future Work

Theoretically, Ruber presents an intriguing approach by emphasizing the relevance of unreferenced metrics in dialog evaluation, consistent with the semantic variability inherent in natural language conversation. While currently applied to single-turn dialog systems, the authors suggest that Ruber could potentially be extended to multi-turn conversational contexts by adapting the neural network component to incorporate dialog history and contextual information.

Future research directions may include refining the learning process of the unreferenced metric to better capture nuanced conversational contexts and integrating sentiment and pragmatic aspects to further align with human evaluative criteria. Additionally, exploring the cross-linguistic applicability of Ruber could pave broad avenues for its adoption in non-English conversational AI systems.

In conclusion, Ruber represents a substantial step forward in the automatic evaluation of dialog systems, offering a more human-aligned and computationally efficient alternative that stands to enhance both the development and assessment of open-domain conversational AI.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Chongyang Tao (61 papers)
Lili Mou (79 papers)
Dongyan Zhao (144 papers)
Rui Yan (250 papers)

Citations (215)

View on Semantic Scholar