Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ask Questions with Double Hints: Visual Question Generation with Answer-awareness and Region-reference (2407.05100v1)

Published 6 Jul 2024 in cs.CV, cs.CL, and cs.MM

Abstract: The visual question generation (VQG) task aims to generate human-like questions from an image and potentially other side information (e.g. answer type). Previous works on VQG fall in two aspects: i) They suffer from one image to many questions mapping problem, which leads to the failure of generating referential and meaningful questions from an image. ii) They fail to model complex implicit relations among the visual objects in an image and also overlook potential interactions between the side information and image. To address these limitations, we first propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. Concretely, we aim to ask the right visual questions with Double Hints - textual answers and visual regions of interests, which could effectively mitigate the existing one-to-many mapping issue. Particularly, we develop a simple methodology to self-learn the visual hints without introducing any additional human annotations. Furthermore, to capture these sophisticated relationships, we propose a new double-hints guided Graph-to-Sequence learning framework, which first models them as a dynamic graph and learns the implicit topology end-to-end, and then utilizes a graph-to-sequence model to generate the questions with double hints. Experimental results demonstrate the priority of our proposed method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou, “Visual question generation as dual task of visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6116–6124.
  2. P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel, “Fvqa: Fact-based visual question answering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 10, p. 2413–2427, Oct. 2018. [Online]. Available: https://doi.org/10.1109/TPAMI.2017.2754246
  3. Q. Wu, C. Shen, P. Wang, A. Dick, and A. v. d. Hengel, “Image captioning and visual question answering based on attributes and external knowledge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1367–1381, 2018.
  4. U. Jain, S. Lazebnik, and A. G. Schwing, “Two can play this game: Visual dialog with discriminative question generation and answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  5. I. M. Mora, S. P. de la Puente, and X. G.-i. Nieto, “Towards automatic generation of question answer pairs from images,” CVPRW, 2016.
  6. R. Krishna, M. Bernstein, and L. Fei-Fei, “Information maximizing visual question generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2008–2018.
  7. N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende, “Generating natural questions about an image,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1802–1813. [Online]. Available: https://aclanthology.org/P16-1170
  8. L. Peng, Y. Yang, Z. Wang, Z. Huang, and H. T. Shen, “Mra-net: Improving vqa via multi-modal relation attention network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2020.
  9. Q. Cao, X. Liang, B. Li, and L. Lin, “Interpretable visual question answering by reasoning on dependency trees,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 3, pp. 887–901, 2021.
  10. F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun, “Inverse visual question answering: A new benchmark and vqa diagnosis tool,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 460–474, 2020.
  11. U. Jain, Z. Zhang, and A. G. Schwing, “Creativity: Generating diverse questions using variational autoencoders,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6485–6494.
  12. S. Zhang, L. Qu, S. You, Z. Yang, and J. Zhang, “Automatic generation of grounded visual questions,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, 2017, pp. 4235–4243. [Online]. Available: https://doi.org/10.24963/ijcai.2017/592
  13. F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun, “ivqa: Inverse visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8611–8619.
  14. M. Shah, X. Chen, M. Rohrbach, and D. Parikh, “Cycle-consistency for robust visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6649–6658.
  15. X. Xu, T. Wang, Y. Yang, A. Hanjalic, and H. T. Shen, “Radial graph convolutional network for visual question generation,” IEEE Transactions on Neural Networks and Learning Systems, 2020.
  16. T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” in Proceedings of the 5th International Conference on Learning Representations, ser. ICLR ’17, 2017. [Online]. Available: https://openreview.net/forum?id=SJU4ayYgl
  17. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.   JMLR. org, 2017, pp. 1263–1272.
  18. K. Xu, L. Wu, Z. Wang, Y. Feng, M. Witbrock, and V. Sheinin, “Graph2seq: Graph to sequence learning with attention-based neural networks,” arXiv preprint arXiv:1804.00823, 2018.
  19. Y. Chen, L. Wu, and M. J. Zaki, “Reinforcement learning based graph-to-sequence model for natural question generation,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HygnDhEtvr
  20. Y. Gao, L. Wu, H. Homayoun, and L. Zhao, “Dyngraph2seq: Dynamic-graph-to-sequence interpretable learning for health stage prediction in online health forums,” arXiv preprint arXiv:1908.08497, 2019.
  21. W. Norcliffe-Brown, S. Vafeias, and S. Parisot, “Learning conditioned graph structures for interpretable visual question answering,” in Advances in Neural Information Processing Systems, 2018, pp. 8334–8343.
  22. Y. Chen, L. Wu, and M. J. Zaki, “Graphflow: Exploiting conversation flow with graph neural networks for conversational machine comprehension,” in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere, Ed.   International Joint Conferences on Artificial Intelligence Organization, 7 2020, pp. 1230–1236, main track. [Online]. Available: https://doi.org/10.24963/ijcai.2020/171
  23. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  24. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
  25. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  26. Y. Chen, L. Wu, and M. Zaki, “Iterative deep graph learning for graph neural networks: Better and robust node embeddings,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 19 314–19 326. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/e05c7ba4e087beea9410929698dc41a6-Paper.pdf
  27. J. Lu, J. Yang, D. Batra, and D. Parikh, “Neural baby talk,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7219–7228.
  28. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  29. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
  30. M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in Advances in neural information processing systems, 2015, pp. 2953–2961.
  31. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  32. Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2,” https://github.com/facebookresearch/detectron2, 2019.
  33. J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization.” Journal of machine learning research, vol. 13, no. 2, 2012.
  34. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  35. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057.
  36. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics.   Association for Computational Linguistics, 2002, pp. 311–318.
  37. R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575.
  38. S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
  39. C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out.   Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://www.aclweb.org/anthology/W04-1013
  40. P. Nema and M. M. Khapra, “Towards a better metric for evaluating question generation systems,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.   Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp. 3950–3959. [Online]. Available: https://aclanthology.org/D18-1429
  41. J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” arXiv preprint arXiv:1805.07932, 2018.
  42. Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co-attention networks for visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  43. Y. Li, Y. Yang, J. Wang, and W. Xu, “Zero-shot transfer vqa dataset,” arXiv preprint arXiv:1811.00692, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Kai Shen (29 papers)
  2. Lingfei Wu (135 papers)
  3. Siliang Tang (116 papers)
  4. Fangli Xu (17 papers)
  5. Bo Long (59 papers)
  6. Yueting Zhuang (164 papers)
  7. Jian Pei (104 papers)