Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset (2212.04119v2)

Published 8 Dec 2022 in cs.CV and cs.CL

Abstract: As sharing images in an instant message is a crucial factor, there has been active research on learning an image-text multi-modal dialogue models. However, training a well-generalized multi-modal dialogue model remains challenging due to the low quality and limited diversity of images per dialogue in existing multi-modal dialogue datasets. In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset, ensuring both dialogue quality and image diversity without requiring minimum human effort. In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments - specifically, the utterance, speaker, rationale, and image description. Furthermore, we leverage CLIP similarity to maintain consistency between aligned multiple images to the utterance. Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset that surpasses existing datasets in terms of quality and diversity in human evaluation. Our comprehensive experiments highlight that when multi-modal dialogue models are trained using our dataset, their generalization performance on unseen dialogue datasets is significantly enhanced. We make our source code and dataset publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540.
  2. Mpchat: Towards multimodal persona-grounded conversation. arXiv preprint arXiv:2305.17388.
  3. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  4. Movieclip: Visual scene recognition in movies. arXiv preprint arXiv:2210.11065.
  5. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
  6. How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  8. Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115.
  9. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335.
  10. Redcaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  12. Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241.
  13. Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. arXiv preprint arXiv:2211.05719.
  14. Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. arXiv preprint arXiv:2106.14843.
  15. Champagne: Learning real-world conversation from large-scale web videos. arXiv preprint arXiv:2303.09713.
  16. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718.
  17. spacy: Industrial-strength natural language processing in python.
  18. Soda: Million-scale dialogue distillation with social commonsense contextualization. arXiv preprint arXiv:2212.10465.
  19. Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. arXiv preprint arXiv:1903.03166.
  20. The open images dataset v4. International Journal of Computer Vision, 128(7):1956–1981.
  21. Collavo: Crayon large language and vision model. arXiv preprint arXiv:2402.11248.
  22. Moai: Mixture of all intelligence for large language and vision models. arXiv preprint arXiv:2403.07508.
  23. Constructing multi-modal dialogue dataset by replacing text with semantically relevant images. arXiv preprint arXiv:2107.08685.
  24. Does gpt-3 generate empathetic dialogues? a novel in-context example selection method and automatic evaluation metric for empathetic dialogue generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 669–683.
  25. Personachatgen: Generating personalized dialogues using gpt-3. In Proceedings of the 1st Workshop on Customized Chat Grounding Persona and Knowledge, pages 29–48.
  26. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4654–4662.
  27. Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957.
  28. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. arXiv preprint arXiv:2203.02053.
  29. Katharina Lobinger. 2016. Photographs as things–photographs of things. a texto-material perspective on photo-sharing practices. Information, Communication & Society, 19(4):475–488.
  30. Towards building an open-domain dialogue system incorporated with internet memes. arXiv preprint arXiv:2203.03835.
  31. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
  32. Openvidial: A large-scale, open-domain dialogue dataset with visual contexts. arXiv preprint arXiv:2012.15015.
  33. Parlai: A dialog research software platform. arXiv preprint arXiv:1705.06476.
  34. George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
  35. Image-grounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251.
  36. Learning audio-video modalities from image captions. arXiv preprint arXiv:2204.00679.
  37. Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Icml.
  38. OpenAI. 2023. Gpt-4 technical report. arXiv.
  39. Ramakanth Pasunuru and Mohit Bansal. 2018. Game-based video-context dialogue. arXiv preprint arXiv:1809.04560.
  40. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
  41. Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:1811.00207.
  42. Visual reference resolution using attention memory for visual dialog. Advances in neural information processing systems, 30.
  43. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
  44. Image chat: Engaging grounded conversations. arXiv preprint arXiv:1811.00945.
  45. Multi-modal open-domain dialogue. arXiv preprint arXiv:2010.01082.
  46. Can you put it all together: Evaluating conversational agents’ ability to blend skills. arXiv preprint arXiv:2004.08449.
  47. Multi-modal mixup for robust fine-tuning. arXiv preprint arXiv:2203.03897.
  48. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.
  49. Multimodal dialogue response generation. arXiv preprint arXiv:2110.08515.
  50. Label Studio: Data labeling software. Open source software available from https://github.com/heartexlabs/label-studio.
  51. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  52. Are gender-neutral queries really gender-neutral? mitigating gender bias in image search. arXiv preprint arXiv:2109.05433.
  53. Openvidial 2.0: A larger-scale, open-domain dialogue generation dataset with visual contexts. arXiv preprint arXiv:2109.12761.
  54. Modeling text-visual mutual dependency for multi-modal dialog generation. arXiv preprint arXiv:2105.14445.
  55. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
  56. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  57. Demystifying clip data. arXiv preprint arXiv:2309.16671.
  58. Photochat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. arXiv preprint arXiv:2108.01453.
  59. Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243.
  60. Mmchat: Multi-modal chat dataset on social media. arXiv preprint arXiv:2108.07154.
  61. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  62. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939.
Citations (6)

Summary

We haven't generated a summary for this paper yet.