MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation (2211.05719v3)

Published 10 Nov 2022 in cs.CL, cs.AI, cs.CV, cs.LG, and cs.MM

Abstract: Responding with multi-modal content has been recognized as an essential capability for an intelligent conversational agent. In this paper, we introduce the MMDialog dataset to better facilitate multi-modal conversation. MMDialog is composed of a curated set of 1.08 million real-world dialogues with 1.53 million unique images across 4,184 topics. MMDialog has two main and unique advantages. First, it is the largest multi-modal conversation dataset by the number of dialogues by 88x. Second, it contains massive topics to generalize the open-domain. To build engaging dialogue system with this dataset, we propose and normalize two response producing tasks based on retrieval and generative scenarios. In addition, we build two baselines for above tasks with state-of-the-art techniques and report their experimental performance. We also propose a novel evaluation metric MM-Relevance to measure the multi-modal responses. Our dataset and scripts are available in https://github.com/victorsungo/MMDialog.

PDF Abstract

An Examination of MMDialog: A Comprehensive Multi-modal Dialogue Dataset

This paper presents MMDialog, an extensive dataset designed to advance research in multi-modal dialogue systems for open-domain conversation. MMDialog addresses several limitations of previous datasets, offering over one million dialogue sessions and more than 1.5 million images across 4,184 topics. Unlike earlier datasets, which often relied on artificially constructed dialogues or were limited in scale and diversity, MMDialog is derived from real conversations on social media, providing a more authentic and expansive dataset for training intelligent conversational agents.

Key Characteristics of MMDialog

MMDialog stands out as the largest multi-modal dialogue dataset, surpassing others by a factor of 88 in terms of the number of dialogues. This scale benefits from the dataset's integration of both textual and visual modalities, offering a richer context for dialogue systems to learn from. Each dialogue session averages 4.56 turns and includes approximately 2.59 images, supporting a naturalistic interaction flow reflective of real human communication. Furthermore, the extensive topic range ensures that the dataset encompasses a wide variety of conversational subjects, facilitating the development of robust open-domain conversational agents.

Task Definition and Baseline Models

The authors propose two primary tasks leveraging MMDialog: multi-modal response generation and multi-modal response retrieval. For the generation task, the focus is on synthesizing responses that might include both text and images. Retrieval tasks, on the other hand, involve selecting the most appropriate response components from a set of candidates, using state-of-the-art retrieval techniques.

To support these tasks, the authors provide baseline models. They adapt Divter, a multi-modal response generation model, to these tasks, employing DialoGPT for text generation and DALL-E for image generation. For the retrieval task, the DE++ model employs a dual-encoder mechanism inspired by CLIP, demonstrating efficacy in extracting and formulating contextually and semantically relevant responses.

Evaluation Metrics and Findings

The paper introduces MM-Relevance, a novel evaluation metric utilizing the CLIP model to measure the alignment between generated or retrieved responses and ground-truth responses across modalities. This metric addresses the challenge of evaluating multi-modal responses where text and image components may not align directly with those in the data.

Experimental results highlight the challenges associated with multi-modal dialogue modeling. The generative baseline Divter exhibits moderate success with BLEU and ROUGE scores for text and achieves an Inception Score (IS) of 20.53 for images, indicating room for improvement in generating coherent and contextually rich responses. The retrieval-based DE++ model demonstrates notable performance with respect to recall metrics, which reflects its capability to accurately select contextually appropriate responses.

Implications and Future Directions

The introduction of MMDialog has significant implications for the development of comprehensive multi-modal conversational agents capable of interpreting and generating nuanced dialogue in an open-domain setting. The dataset's scale and diversity are likely to drive innovations in how AI systems understand and interact across different modalities.

Future research directions suggested by the authors include improving the alignment and quality of generated multi-modal responses and exploring more sophisticated multi-modal integration techniques within dialogue systems. The dataset's release is positioned to serve as a valuable resource for ongoing research efforts to enhance the responsiveness and versatility of conversational AI technologies. By tackling the challenges laid out by MMDialog, researchers can look forward to advancements in making dialogue systems as perceptive and interactive as human communicators.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Jiazhan Feng (11 papers)
Qingfeng Sun (40 papers)
Can Xu (98 papers)
Pu Zhao (82 papers)
Yaming Yang (39 papers)
Chongyang Tao (61 papers)
Dongyan Zhao (144 papers)
Qingwei Lin (81 papers)

Citations (40)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - victorsungo/MMDialog: The official site of paper MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation (188 stars)