Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMChat: Multi-Modal Chat Dataset on Social Media (2108.07154v3)

Published 16 Aug 2021 in cs.CL and cs.CV

Abstract: Incorporating multi-modal contexts in conversation is important for developing more engaging dialogue systems. In this work, we explore this direction by introducing MMChat: a large-scale Chinese multi-modal dialogue corpus (32.4M raw dialogues and 120.84K filtered dialogues). Unlike previous corpora that are crowd-sourced or collected from fictitious movies, MMChat contains image-grounded dialogues collected from real conversations on social media, in which the sparsity issue is observed. Specifically, image-initiated dialogues in common communications may deviate to some non-image-grounded topics as the conversation proceeds. To better investigate this issue, we manually annotate 100K dialogues from MMChat and further filter the corpus accordingly, which yields MMChat-hf. We develop a benchmark model to address the sparsity issue in dialogue generation tasks by adapting the attention routing mechanism on image features. Experiments demonstrate the usefulness of incorporating image features and the effectiveness of handling the sparsity of image features.

An Analytical Overview of "MMChat: Multi-Modal Chat Dataset on Social Media"

The paper introduces MMChat, a comprehensive Chinese multi-modal dialogue dataset specifically collected from social media to advance the development of Multi-Modal Dialogue Systems (MMDSs). Unlike previous datasets that rely on crowd-sourced or fictional movie dialogues, MMChat introduces image-grounded dialogues sourced from authentic social media interactions, tackling the sparsity challenge where conversations may drift from initial image-grounded topics.

Key Contributions and Dataset Construction

Key contributions outlined in the paper include:

  1. Dataset Construction: The MMChat corpus includes 32.4 million raw dialogues, from which 120.84K sessions were filtered as high-quality image-grounded dialogues. The collection emphasizes conversations initiated by images but allows for topic drift, highlighting the realistic nature of social media interactions.
  2. Manual Filtering to Create MMChat-hf: Through manual annotation of 100K dialogue sessions, a subset named MMChat-hf was derived, further refining the corpus to include 19.90K sessions with enhanced image-dialogue correlation.
  3. Benchmark Model: A benchmark model was developed incorporating an attention routing mechanism to effectively handle image sparsity, demonstrating its utility in open-domain dialogue generation.

Experimental Findings

The experimental section reveals notable findings:

  • Integration of visual contexts into dialogue systems markedly enhances response quality, as demonstrated by improvements in BLEU scores compared to text-only models.
  • The attention routing mechanism significantly mitigates the sparsity challenge, offering a technique for more nuanced multi-modal dialogue generation.
  • The MMChat-hf data yielded higher BLEU and distinct scores versus MMChat, underscoring the importance of rigorous filtering in improving dataset quality.

Theoretical and Practical Implications

The introduction of MMChat has significant implications:

  • Theoretical: Provides a robust foundation for examining the phenomenon of topic drift in multi-modal conversations, propelling research into the dynamics of natural human dialogue in digital environments.
  • Practical: By offering a dataset grounded in real-world interactions, MMChat facilitates the creation of MMDSs that can better mimic genuine human dialogue, enhancing user engagement in conversational agents.

Future Directions

Future work stemming from this research could involve:

  • Extending the paper to include audio and gesture modalities, enriching the understanding of multi-modal interactions.
  • Further exploration into the scalability of MMChat for transfer learning in low-resource dialogue systems.
  • Addressing ethical considerations such as privacy and bias, towards developing more inclusive and responsible AI systems.

In conclusion, MMChat represents a significant stride towards authentic and sophisticated MMDSs by presenting a dataset grounded in real-world interactions, paired with a proposed solution to tackle the sparsity of image-grounded dialogue. This work lays a foundation for future advancements in AI dialogue research and applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yinhe Zheng (30 papers)
  2. Guanyi Chen (26 papers)
  3. Xin Liu (820 papers)
  4. Jian Sun (414 papers)
Citations (31)
Github Logo Streamline Icon: https://streamlinehq.com