Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models (2312.13503v1)

Published 21 Dec 2023 in cs.CV and cs.AI

Abstract: In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the LLMs (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image scene texts, and $36.7\%$ require external knowledge. Each round's answer is also long and open-ended: $87.3\%$ of answers are unique with an average length of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Do not have enough data? deep learning to the rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7383–7390, 2020.
  2. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
  3. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  4. Language models are few-shot learners. In NeurIPS, 2020.
  5. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  6. Utc: A unified transformer with inter-task contrastive learning for visual dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18103–18112, 2022.
  7. QuAC: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017.
  10. Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579, 2019.
  11. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, pages 6904–6913, 2017.
  12. Image-question-answer synergistic network for visual dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10434–10443, 2019.
  13. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  14. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918, 2021.
  15. The dialog must go on: Improving visual dialog via generative self-training. arXiv preprint arXiv:2205.12502, 2022.
  16. Visual coreference resolution in visual dialog using neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 153–169, 2018.
  17. Guided generation of cause and effect. arXiv preprint arXiv:2107.09846, 2021.
  18. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  19. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  20. Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In European Conference on Computer Vision, pages 336–352. Springer, 2020.
  21. Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In European Conference on Computer Vision, pages 223–240. Springer, 2020.
  22. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  23. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266, 2019.
  24. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 15–25, Vancouver, Canada, Aug. 2017. Association for Computational Linguistics.
  25. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, pages 742–758, 2020.
  26. Towards vqa models that can read. In CVPR, 2019.
  27. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  28. Want to reduce labeling cost? GPT-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195–4205, Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  29. Vd-bert: A unified vision and dialog transformer with bert. arXiv preprint arXiv:2004.13278, 2020.
  30. Language models with image descriptors are strong few-shot video-language learners. arXiv preprint arXiv:2205.10747, 2022.
  31. Simvlm: Simple visual language model pretraining with weak supervision. In ICLR, 2022.
  32. Finetuned language models are zero-shot learners. In ICLR, 2022.
  33. Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178, 2021.
  34. Visual clues: Bridging vision and language foundations for image paragraph captioning. arXiv preprint arXiv:2206.01843, 2022.
  35. Generative data augmentation for commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1008–1025, Online, Nov. 2020. Association for Computational Linguistics.
  36. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022.
  37. A comprehensive assessment of dialog evaluation metrics. arXiv preprint arXiv:2106.03706, 2021.
  38. GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  39. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
  40. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  41. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  42. Reasoning visual dialogs with structural and partial observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6669–6678, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Bingbing Wen (11 papers)
  2. Zhengyuan Yang (86 papers)
  3. Jianfeng Wang (149 papers)
  4. Zhe Gan (135 papers)
  5. Bill Howe (39 papers)
  6. Lijuan Wang (133 papers)