Papers
Topics
Authors
Recent
2000 character limit reached

A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions (2403.17545v1)

Published 26 Mar 2024 in cs.CL and cs.CV

Abstract: Situated conversations, which refer to visual information as visual question answering (VQA), often contain ambiguities caused by reliance on directive information. This problem is exacerbated because some languages, such as Japanese, often omit subjective or objective terms. Such ambiguities in questions are often clarified by the contexts in conversational situations, such as joint attention with a user or user gaze information. In this study, we propose the Gaze-grounded VQA dataset (GazeVQA) that clarifies ambiguous questions using gaze information by focusing on a clarification process complemented by gaze information. We also propose a method that utilizes gaze target estimation results to improve the accuracy of GazeVQA tasks. Our experimental results showed that the proposed method improved the performance in some cases of a VQA system on GazeVQA and identified some typical problems of GazeVQA tasks that need to be improved.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. History for visual dialog: Do we really need it? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8182–8197.
  2. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433.
  3. Saliency based image cropping. In Proceedings of the 17th International Conference Progress in Image Analysis and Processing, volume 8156, pages 773–782.
  4. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 1931–1942.
  5. Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency. In Proceedings of 15th the European Conference on Computer Vision, pages 397–412.
  6. Detecting attended visual targets in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5395–5405.
  7. Human attention in visual question answering: Do humans and deep networks look at the same regions? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 932–937.
  8. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 326–335.
  9. Feature-wise transformations. Distill, 3(7):e11.
  10. Nathan J. Emery. 2000. The eyes have it: the neuroethology, function and evolution of social gaze. Neuroscience & Biobehavioral Reviews, 24(6):581–604.
  11. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6904–6913.
  12. GazeVQA: A video question answering dataset for multiview eye-gaze task-oriented collaborations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10462–10479.
  13. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, pages 740–755.
  14. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representations.
  15. Timo Lüddecke and Alexander Ecker. 2022. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096.
  16. Point and ask: Incorporating pointing into visual question answering. arXiv:2011.13681.
  17. ClipCap: CLIP prefix for image captioning. arXiv:2111.09734.
  18. DeePoint: Visual pointing recognition and direction estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20577–20587.
  19. OpenAI. 2023. GPT-4 technical report. arXiv:2303.08774.
  20. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 8748–8763.
  21. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  22. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  23. Where are they looking? In Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 1, pages 199–207.
  24. Faster R-CNN: towards real-time object detection with region proposal networks. In Proceedings of the 28th Advances in Neural Information Processing Systems, volume 28, pages 91–99.
  25. This and that back in context: Grounding demonstrative reference in manual and social affordances. In Proceedings of the 40th Annual Meeting of the Cognitive Science Society.
  26. A fully-lexicalized probabilistic model for Japanese zero anaphora resolution. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 769–776.
  27. A probabilistic method for analyzing Japanese anaphora integrating zero pronoun detection and resolution. In Proceedings of the 19th International Conference on Computational Linguistics.
  28. Visual question answering dataset for bilingual image understanding: A study of cross-lingual transfer using attention maps. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1918–1928.
  29. VQA-MHUG: A gaze dataset to study multimodal neural attention in visual question answering. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 27–43.
  30. Natural deictic communication with humanoid robots. In Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1441–1448.
  31. VL-Adapter: parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237.
  32. Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 6105–6114.
  33. Survey on frontiers of language and robotics. Advanced Robotics, 33(15–16):700–730.
  34. Goo: A dataset for gaze object prediction in retail environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.
  35. Nobuhiro Ueda et al. 2023. KWJA: A unified Japanese analyzer based on foundation models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, volume 3, pages 538–548.
  36. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6000–6010.
  37. STAIR captions: Constructing a large-scale Japanese image caption dataset. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, volume 2, pages 417–421.
  38. What you see is what you get: Visual pronoun coreference resolution in dialogues. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 5123–5132.
  39. Bertscore: Evaluating text generation with BERT. In Proceedings of the 8th International Conference on Learning Representations.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 2 likes.

Upgrade to Pro to view all of the tweets about this paper: