Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual Question Generation in Bengali (2310.08187v1)

Published 12 Oct 2023 in cs.CL

Abstract: The task of Visual Question Generation (VQG) is to generate human-like questions relevant to the given image. As VQG is an emerging research field, existing works tend to focus only on resource-rich language such as English due to the availability of datasets. In this paper, we propose the first Bengali Visual Question Generation task and develop a novel transformer-based encoder-decoder architecture that generates questions in Bengali when given an image. We propose multiple variants of models - (i) image-only: baseline model of generating questions from images without additional information, (ii) image-category and image-answer-category: guided VQG where we condition the model to generate questions based on the answer and the category of expected question. These models are trained and evaluated on the translated VQAv2.0 dataset. Our quantitative and qualitative results establish the first state of the art models for VQG task in Bengali and demonstrate that our models are capable of generating grammatically correct and relevant questions. Our quantitative results show that our image-cat model achieves a BLUE-1 score of 33.12 and BLEU-3 score of 7.56 which is the highest of the other two variants. We also perform a human evaluation to assess the quality of the generation tasks. Human evaluation suggests that image-cat model is capable of generating goal-driven and attribute-specific questions and also stays relevant to the corresponding image.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  2. Reading comprehension based question answering system in bangla language with transformer-based learning. Heliyon, 8(10):e11052.
  3. Findings of the third shared task on multimodal machine translation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 304–323, Belgium, Brussels. Association for Computational Linguistics.
  4. Rubi: Reducing unimodal biases for visual question answering. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  5. Probing the need for visual context in multimodal machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4159–4170, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. Deep learning for video captioning: A review. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 6283–6290. International Joint Conferences on Artificial Intelligence Organization.
  7. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  8. Findings of the second shared task on multimodal machine translation and multilingual image description. In Proceedings of the Second Conference on Machine Translation, pages 215–233, Copenhagen, Denmark. Association for Computational Linguistics.
  9. A reinforcement learning framework for natural question generation using bi-discriminators. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1763–1774, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  10. Are you talking to a machine? dataset and methods for multilingual image question answering. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 2296–2304, Cambridge, MA, USA. MIT Press.
  11. Loss re-scaling vqa: Revisiting the language prior problem from a class-imbalance view. Trans. Img. Proc., 31:227–238.
  12. A unified framework for multilingual and code-mixed visual question answering. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 900–913, Suzhou, China. Association for Computational Linguistics.
  13. Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2612–2623, Online. Association for Computational Linguistics.
  14. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
  15. Note: Towards devising an efficient vqa in the bengali language. In ACM SIGCAS/SIGCHI Conference on Computing and Sustainable Societies (COMPASS), COMPASS ’22, page 632–637, New York, NY, USA. Association for Computing Machinery.
  16. Creativity: Generating diverse questions using variational autoencoders. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5415–5424, Los Alamitos, CA, USA. IEEE Computer Society.
  17. In defense of grid features for visual question answering. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10264–10273, Los Alamitos, CA, USA. IEEE Computer Society.
  18. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1988–1997, Los Alamitos, CA, USA. IEEE Computer Society.
  19. Deep visual-semantic alignments for generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):664–676.
  20. Information maximizing visual question generation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2008–2018, Los Alamitos, CA, USA. IEEE Computer Society.
  21. Alon Lavie and Abhaya Agarwal. 2007. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 228–231, Prague, Czech Republic. Association for Computational Linguistics.
  22. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  23. Deep learning based question answering system in bengali. Journal of Information and Telecommunication, 5(2):145–178.
  24. Image-grounded conversations: Multimodal context for natural question and response generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 462–472, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  25. Generating natural questions about an image. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1802–1813, Berlin, Germany. Association for Computational Linguistics.
  26. Recent advances in neural question generation.
  27. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  28. Multi-modality latent interaction network for visual question answering. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 5824–5834.
  29. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
  30. A deep learning-based bengali visual question answering system. In 2022 25th International Conference on Computer and Information Technology (ICCIT), pages 114–119.
  31. Exploring models and data for image question answering. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 2953–2961, Cambridge, MA, USA. MIT Press.
  32. Sagor Sarkar. 2019. https://github.com/sagorbrur/glove-bengali.
  33. What bert sees: Cross-modal transfer for visual question generation. In International Conference on Natural Language Generation.
  34. Generating factoid questions with recurrent neural networks: The 30M factoid question-answer corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 588–598, Berlin, Germany. Association for Computational Linguistics.
  35. Visual question answering dataset for bilingual image understanding: A study of cross-lingual transfer using attention maps. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1918–1928, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  36. A shared task on multimodal machine translation and crosslingual image description. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 543–553, Berlin, Germany. Association for Computational Linguistics.
  37. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  38. Cider: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575.
  39. Guiding visual question generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1640–1654, Seattle, United States. Association for Computational Linguistics.
  40. Diverse beam search: Decoding diverse solutions from neural sequence models.
  41. Show and tell: A neural image caption generator. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3156–3164, Los Alamitos, CA, USA. IEEE Computer Society.
  42. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 2048–2057. JMLR.org.
  43. Automatic generation of grounded visual questions. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4235–4243, United States of America. Association for the Advancement of Artificial Intelligence (AAAI). International Joint Conference on Artificial Intelligence 2017, IJCAI 2017 ; Conference date: 19-08-2017 Through 25-08-2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Mahmud Hasan (7 papers)
  2. Labiba Islam (1 paper)
  3. Jannatul Ferdous Ruma (1 paper)
  4. Tasmiah Tahsin Mayeesha (2 papers)
  5. Rashedur M. Rahman (3 papers)
Citations (1)