Papers
Topics
Authors
Recent
Search
2000 character limit reached

MMChat: Multi-Modal Chat Dataset on Social Media

Published 16 Aug 2021 in cs.CL and cs.CV | (2108.07154v3)

Abstract: Incorporating multi-modal contexts in conversation is important for developing more engaging dialogue systems. In this work, we explore this direction by introducing MMChat: a large-scale Chinese multi-modal dialogue corpus (32.4M raw dialogues and 120.84K filtered dialogues). Unlike previous corpora that are crowd-sourced or collected from fictitious movies, MMChat contains image-grounded dialogues collected from real conversations on social media, in which the sparsity issue is observed. Specifically, image-initiated dialogues in common communications may deviate to some non-image-grounded topics as the conversation proceeds. To better investigate this issue, we manually annotate 100K dialogues from MMChat and further filter the corpus accordingly, which yields MMChat-hf. We develop a benchmark model to address the sparsity issue in dialogue generation tasks by adapting the attention routing mechanism on image features. Experiments demonstrate the usefulness of incorporating image features and the effectiveness of handling the sparsity of image features.

Citations (31)

Summary

  • The paper presents the MMChat dataset, comprising 120.84K high-quality image-grounded dialogue sessions extracted from 32.4M raw social media interactions.
  • It details a manual filtering process to create the MMChat-hf subset, ensuring enhanced image-dialogue correlation and improved BLEU scores.
  • The study introduces an attention routing mechanism that effectively mitigates image sparsity, advancing multi-modal dialogue systems.

An Analytical Overview of "MMChat: Multi-Modal Chat Dataset on Social Media"

The paper introduces MMChat, a comprehensive Chinese multi-modal dialogue dataset specifically collected from social media to advance the development of Multi-Modal Dialogue Systems (MMDSs). Unlike previous datasets that rely on crowd-sourced or fictional movie dialogues, MMChat introduces image-grounded dialogues sourced from authentic social media interactions, tackling the sparsity challenge where conversations may drift from initial image-grounded topics.

Key Contributions and Dataset Construction

Key contributions outlined in the paper include:

  1. Dataset Construction: The MMChat corpus includes 32.4 million raw dialogues, from which 120.84K sessions were filtered as high-quality image-grounded dialogues. The collection emphasizes conversations initiated by images but allows for topic drift, highlighting the realistic nature of social media interactions.
  2. Manual Filtering to Create MMChat-hf: Through manual annotation of 100K dialogue sessions, a subset named MMChat-hf was derived, further refining the corpus to include 19.90K sessions with enhanced image-dialogue correlation.
  3. Benchmark Model: A benchmark model was developed incorporating an attention routing mechanism to effectively handle image sparsity, demonstrating its utility in open-domain dialogue generation.

Experimental Findings

The experimental section reveals notable findings:

  • Integration of visual contexts into dialogue systems markedly enhances response quality, as demonstrated by improvements in BLEU scores compared to text-only models.
  • The attention routing mechanism significantly mitigates the sparsity challenge, offering a technique for more nuanced multi-modal dialogue generation.
  • The MMChat-hf data yielded higher BLEU and distinct scores versus MMChat, underscoring the importance of rigorous filtering in improving dataset quality.

Theoretical and Practical Implications

The introduction of MMChat has significant implications:

  • Theoretical: Provides a robust foundation for examining the phenomenon of topic drift in multi-modal conversations, propelling research into the dynamics of natural human dialogue in digital environments.
  • Practical: By offering a dataset grounded in real-world interactions, MMChat facilitates the creation of MMDSs that can better mimic genuine human dialogue, enhancing user engagement in conversational agents.

Future Directions

Future work stemming from this research could involve:

  • Extending the study to include audio and gesture modalities, enriching the understanding of multi-modal interactions.
  • Further exploration into the scalability of MMChat for transfer learning in low-resource dialogue systems.
  • Addressing ethical considerations such as privacy and bias, towards developing more inclusive and responsible AI systems.

In conclusion, MMChat represents a significant stride towards authentic and sophisticated MMDSs by presenting a dataset grounded in real-world interactions, paired with a proposed solution to tackle the sparsity of image-grounded dialogue. This work lays a foundation for future advancements in AI dialogue research and applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (4)

Collections

Sign up for free to add this paper to one or more collections.