Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Read, Look or Listen? What's Needed for Solving a Multimodal Dataset (2307.04532v1)

Published 6 Jul 2023 in cs.CV, cs.AI, cs.CL, and eess.AS

Abstract: The prevalence of large-scale multimodal datasets presents unique challenges in assessing dataset quality. We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. Our method sheds light on the importance of different modalities in datasets, as well as the relationship between them. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. Moreover, we find that more than 70% of the questions are solvable using several different single-modality strategies, e.g., by either looking at the video or listening to the audio, highlighting the limited integration of multiple modalities in TVQA. We leverage our annotation and analyze the MERLOT Reserve, finding that it struggles with image-based questions compared to text and audio, but also with auditory speaker identification. Based on our observations, we introduce a new test set that necessitates multiple modalities, observing a dramatic drop in model performance. Our methodology provides valuable insights into multimodal datasets and highlights the need for the development of more robust models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. URL https://aclanthology.org/2021.emnlp-main.98.
  2. Distribution density, tails, and outliers in machine learning: Metrics and applications, 2019. arXiv:1910.13427.
  3. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Conference on Empirical Methods in Natural Language Processing, 2020.
  4. Spread spurious attribute: Improving worst-group accuracy with spurious attribute estimation. In International Conference on Learning Representations, 2022.
  5. Metadata archaeology: Unearthing data subsets by leveraging training dynamics, 2022. arXiv: 2209.10015.
  6. Stable bias: Analyzing societal representations in diffusion models, 2023. arXiv:2303.11408.
  7. Training dynamic based data filtering may not work for nlp datasets. In BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2021.
  8. Movieqa: Understanding stories in movies through question-answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4631–4640, 2015.
  9. Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 829–832, 2017.
  10. TVQA: Localized, compositional video question answering, 2018. arXiv:1809.01696.
  11. On modality bias in the tvqa dataset. In The British Machine Vision Conference. DU, 2020.
  12. Merlot reserve: Neural script knowledge through vision and language and sound. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16354–16366, 2022.
  13. Tin Kam Ho. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, volume 1, pages 278–282. IEEE, 1995.
  14. HybridQA: A dataset of multi-hop question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1026–1036, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.91. URL https://aclanthology.org/2020.findings-emnlp.91.
  15. Lifeqa: A real-life dataset for video question answering. In International Conference on Language Resources and Evaluation, 2020.
  16. Yin and yang: Balancing and answering binary visual questions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5014–5022, 2015.
  17. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 127:398–414, 2016.
  18. Revisiting visual question answering baselines. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 727–739. Springer, 2016.
  19. Shayan Hassantabar. Visual question answering : Datasets , methods , challenges and oppurtunities, 2018. URL http://www.cs.princeton.edu/courses/archive/spring18/cos598B/public/projects/LiteratureReview/COS598B_spr2018_VQAreview.pdf.
  20. Data efficient masked language modeling for vision and language. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3013–3028, 2021.
  21. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1554–1563, 2021.
  22. The value of out-of-distribution data. In NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022.
  23. Spacerini: Plug-and-play search engines with pyserini and hugging face, 2023. arXiv:2302.14534.
  24. Decoupling the role of data, attention, and losses in multimodal transformers. Transactions of the Association for Computational Linguistics, 9:570–585, 2021.
  25. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  26. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016.
  27. Being negative but constructively: Lessons learnt from creating better visual question answering datasets. In North American Chapter of the Association for Computational Linguistics, 2017.
  28. Multisubs: A large-scale multimodal and multilingual dataset. In International Conference on Language Resources and Evaluation, 2021.
  29. MultiModalQA: complex question answering over text, tables and images. In International Conference on Learning Representations, 2021.
  30. Audio visual scene-aware dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7558–7567, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Netta Madvil (2 papers)
  2. Yonatan Bitton (36 papers)
  3. Roy Schwartz (74 papers)
Citations (2)