DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset (2212.04119v2)
Abstract: As sharing images in an instant message is a crucial factor, there has been active research on learning an image-text multi-modal dialogue models. However, training a well-generalized multi-modal dialogue model remains challenging due to the low quality and limited diversity of images per dialogue in existing multi-modal dialogue datasets. In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset, ensuring both dialogue quality and image diversity without requiring minimum human effort. In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments - specifically, the utterance, speaker, rationale, and image description. Furthermore, we leverage CLIP similarity to maintain consistency between aligned multiple images to the utterance. Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset that surpasses existing datasets in terms of quality and diversity in human evaluation. Our comprehensive experiments highlight that when multi-modal dialogue models are trained using our dataset, their generalization performance on unseen dialogue datasets is significantly enhanced. We make our source code and dataset publicly available.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540.
- Mpchat: Towards multimodal persona-grounded conversation. arXiv preprint arXiv:2305.17388.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- Movieclip: Visual scene recognition in movies. arXiv preprint arXiv:2210.11065.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
- How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
- Fine-grained image captioning with clip reward. arXiv preprint arXiv:2205.13115.
- Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335.
- Redcaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241.
- Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. arXiv preprint arXiv:2211.05719.
- Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. arXiv preprint arXiv:2106.14843.
- Champagne: Learning real-world conversation from large-scale web videos. arXiv preprint arXiv:2303.09713.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718.
- spacy: Industrial-strength natural language processing in python.
- Soda: Million-scale dialogue distillation with social commonsense contextualization. arXiv preprint arXiv:2212.10465.
- Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. arXiv preprint arXiv:1903.03166.
- The open images dataset v4. International Journal of Computer Vision, 128(7):1956–1981.
- Collavo: Crayon large language and vision model. arXiv preprint arXiv:2402.11248.
- Moai: Mixture of all intelligence for large language and vision models. arXiv preprint arXiv:2403.07508.
- Constructing multi-modal dialogue dataset by replacing text with semantically relevant images. arXiv preprint arXiv:2107.08685.
- Does gpt-3 generate empathetic dialogues? a novel in-context example selection method and automatic evaluation metric for empathetic dialogue generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 669–683.
- Personachatgen: Generating personalized dialogues using gpt-3. In Proceedings of the 1st Workshop on Customized Chat Grounding Persona and Knowledge, pages 29–48.
- Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4654–4662.
- Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957.
- Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. arXiv preprint arXiv:2203.02053.
- Katharina Lobinger. 2016. Photographs as things–photographs of things. a texto-material perspective on photo-sharing practices. Information, Communication & Society, 19(4):475–488.
- Towards building an open-domain dialogue system incorporated with internet memes. arXiv preprint arXiv:2203.03835.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- Openvidial: A large-scale, open-domain dialogue dataset with visual contexts. arXiv preprint arXiv:2012.15015.
- Parlai: A dialog research software platform. arXiv preprint arXiv:1705.06476.
- George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
- Image-grounded conversations: Multimodal context for natural question and response generation. arXiv preprint arXiv:1701.08251.
- Learning audio-video modalities from image captions. arXiv preprint arXiv:2204.00679.
- Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Icml.
- OpenAI. 2023. Gpt-4 technical report. arXiv.
- Ramakanth Pasunuru and Mohit Bansal. 2018. Game-based video-context dialogue. arXiv preprint arXiv:1809.04560.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:1811.00207.
- Visual reference resolution using attention memory for visual dialog. Advances in neural information processing systems, 30.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
- Image chat: Engaging grounded conversations. arXiv preprint arXiv:1811.00945.
- Multi-modal open-domain dialogue. arXiv preprint arXiv:2010.01082.
- Can you put it all together: Evaluating conversational agents’ ability to blend skills. arXiv preprint arXiv:2004.08449.
- Multi-modal mixup for robust fine-tuning. arXiv preprint arXiv:2203.03897.
- Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530.
- Multimodal dialogue response generation. arXiv preprint arXiv:2110.08515.
- Label Studio: Data labeling software. Open source software available from https://github.com/heartexlabs/label-studio.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Are gender-neutral queries really gender-neutral? mitigating gender bias in image search. arXiv preprint arXiv:2109.05433.
- Openvidial 2.0: A larger-scale, open-domain dialogue generation dataset with visual contexts. arXiv preprint arXiv:2109.12761.
- Modeling text-visual mutual dependency for multi-modal dialog generation. arXiv preprint arXiv:2105.14445.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Demystifying clip data. arXiv preprint arXiv:2309.16671.
- Photochat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. arXiv preprint arXiv:2108.01453.
- Personalizing dialogue agents: I have a dog, do you have pets too? arXiv preprint arXiv:1801.07243.
- Mmchat: Multi-modal chat dataset on social media. arXiv preprint arXiv:2108.07154.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939.