Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fully Authentic Visual Question Answering Dataset from Online Communities (2311.15562v4)

Published 27 Nov 2023 in cs.CV

Abstract: Visual Question Answering (VQA) entails answering questions about images. We introduce the first VQA dataset in which all contents originate from an authentic use case. Sourced from online question answering community forums, we call it VQAonline. We characterize this dataset and how it relates to eight mainstream VQA datasets. Observing that answers in our dataset tend to be much longer (i.e., a mean of 173 words) and so incompatible with standard VQA evaluation metrics, we instead utilize popular metrics for longer text evaluation for evaluating six state-of-the-art VQA models on VQAonline and report where they struggle most. Finally, we analyze which evaluation metrics align best with human judgments. To facilitate future extensions, we publicly-share the dataset at: https://vqaonline.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  2. Latr: Layout-aware transformer for scene-text vqa. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16548–16558, 2022.
  3. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2616–2627, 2023.
  4. A. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1196–1207, 2022.
  5. Visual challenges in the everyday lives of blind people. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 2117–2126, 2013.
  6. Coyo-700m: Image-text pair dataset, 2022.
  7. An intent taxonomy for questions asked in web search. In Proceedings of the 2021 Conference on Human Information Interaction and Retrieval, pages 85–94, 2021.
  8. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  9. Understanding user intent in community question answering. In Proceedings of the 21st international conference on world wide web, pages 823–828, 2012.
  10. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  11. Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  13. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1292–1302, 2013.
  14. Stack Exchange. Gpt-4v pricing, 2023. Accessed: 2023-11-16.
  15. Music information seeking via social q&a: An analysis of questions in music stackexchange community. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pages 139–142, 2016.
  16. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023.
  17. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  18. Crowdverge: Predicting if people will agree on the answer to a visual question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 3511–3522, 2017.
  19. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608–3617, 2018.
  20. Captioning images taken by people who are blind. arXiv preprint arXiv:2002.08565, 2020.
  21. Question types in social q&a sites. First Monday, 2010.
  22. An educational robot system of visual question answering for preschoolers. In 2017 2nd International Conference on Robotics and Automation Engineering (ICRAE), pages 441–445. IEEE, 2017.
  23. CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
  24. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. arXiv preprint arXiv:2302.11154, 2023.
  25. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019.
  26. Annotating question types in social q&a sites. In Tagungsband des GSCL Symposiums ‘Sprachtechnologie und eHumanities, pages 44–49, 2009.
  27. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2901–2910, 2017.
  28. User intent in multimedia search: a survey of the state of the art and future challenges. ACM Computing Surveys (CSUR), 49(2):1–37, 2016.
  29. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017.
  30. Tvqa: Localized, compositional video question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1369–1379, 2018.
  31. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  32. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  33. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  34. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
  35. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  36. Zhe Liu. Understanding and modeling user behavior in social question and answering. 2015.
  37. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
  38. Ok-vqa: A visual question answering benchmark requiring external knowledge. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  39. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  40. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1697–1706, 2022.
  41. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
  42. Context-vqa: Towards context-aware and purposeful visual question answering. arXiv preprint arXiv:2307.15745, 2023.
  43. Multimedia answering: enriching text qa with media information. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 695–704, 2011.
  44. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  45. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  46. Scienceqa: a novel resource for question answering on scholarly articles. International Journal on Digital Libraries, 23(3):289–301, 2022.
  47. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  48. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  49. Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019.
  50. A comprehensive survey and classification of approaches for community question answering. ACM Transactions on the Web (TWEB), 10(3):1–63, 2016.
  51. Discovering high quality answers in community question answering archives using a hierarchy of classifiers. Information Sciences, 261:101–115, 2014.
  52. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding, 163:21–40, 2017.
  53. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
  54. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023.
  55. mplug-owl: Modularization empowers large language models with multimodality, 2023.
  56. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  57. Vision skills needed to answer visual questions. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2):1–31, 2020.
  58. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  59. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com