Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Top-Down Framework for Weakly-supervised Grounded Image Captioning (2306.07490v3)

Published 13 Jun 2023 in cs.CV

Abstract: Weakly-supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) encode the input image into multiple region features using an object detector; (2) leverage region features for captioning and grounding. However, utilizing independent proposals produced by object detectors tends to make the subsequent grounded captioner overfitted in finding the correct object words, overlooking the relation between objects, and selecting incompatible proposal regions for grounding. To address these issues, we propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. Specifically, we encode the image into visual token representations and propose a Recurrent Grounding Module (RGM) in the decoder to obtain precise Visual Language Attention Maps (VLAMs), which recognize the spatial locations of the objects. In addition, we explicitly inject a relation module into our one-stage framework to encourage relation understanding through multi-label classification. This relation semantics served as contextual information facilitating the prediction of relation and object words in the caption. We observe that the relation semantic not only assists the grounded captioner in generating a more accurate caption but also improves the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in CVPR, 2020.
  2. M. Barraco, M. Cornia, S. Cascianelli, L. Baraldi, and R. Cucchiara, “The unreasonable effectiveness of clip features for image captioning: an experimental analysis,” in CVPR, 2022.
  3. R. Mokady, A. Hertz, and A. H. Bermano, “Clipcap: CLIP prefix for image captioning,” arXiv preprint arXiv:2111.09734, 2021.
  4. W. Jiang, L. Ma, Y.-G. Jiang, W. Liu, and T. Zhang, “Recurrent fusion network for image captioning,” in ECCV, 2018.
  5. K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015.
  6. C. Yan, Y. Hao, L. Li, J. Yin, A. Liu, Z. Mao, Z. Chen, and X. Gao, “Task-adaptive attention for image captioning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 43–51, 2022.
  7. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018.
  8. T. Xian, Z. Li, Z. Tang, and H. Ma, “Adaptive path selection for dynamic image captioning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 9, pp. 5762–5775, 2022.
  9. J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4467–4480, 2020.
  10. L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention for image captioning,” in ICCV, 2019.
  11. Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in CVPR, 2020.
  12. S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” in NeurIPS, 2019.
  13. N. Chen, X. Pan, R. Chen, L. Yang, Z. Lin, Y. Ren, H. Yuan, X. Guo, F. Huang, and W. Wang, “Distributed attention for grounded image captioning,” in ACMMM, 2021.
  14. W. Jiang, M. Zhu, Y. Fang, G. Shi, X. Zhao, and Y. Liu, “Visual cluster grounding for image captioning,” IEEE Transactions on Image Processing, vol. 31, pp. 3920–3934, 2022.
  15. W. Zhang, H. Shi, S. Tang, J. Xiao, Q. Yu, and Y. Zhuang, “Consensus graph representation learning for better grounded image captioning,” in AAAI, 2021.
  16. L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach, “Grounded video description,” in CVPR, 2019.
  17. Y. Zhou, M. Wang, D. Liu, Z. Hu, and H. Zhang, “More grounded image captioning by distilling image-text matching model,” in CVPR, 2020.
  18. F. Liu, X. Ren, X. Wu, S. Ge, W. Fan, Y. Zou, and X. Sun, “Prophet attention: Predicting attention with future attention,” in NeurIPS, 2020.
  19. C.-Y. Ma, Y. Kalantidis, G. AlRegib, P. Vajda, M. Rohrbach, and Z. Kira, “Learning to generate grounded visual captions without localization supervision,” in ECCV, 2020.
  20. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  21. E. Mavroudi and R. Vidal, “Weakly-supervised generation and grounding of visual descriptions with conditional generative models,” in CVPR, 2022.
  22. C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan, “Visual grounding via accumulated attention,” in CVPR, 2018.
  23. R. Hong, D. Liu, X. Mo, X. He, and H. Zhang, “Learning to compose and reason with language tree structures for visual grounding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 684–696, 2022.
  24. Y. Liao, S. Liu, G. Li, F. Wang, Y. Chen, C. Qian, and B. Li, “A real-time cross-modality correlation filtering method for referring expression comprehension,” in CVPR, 2020.
  25. J. Deng, Z. Yang, T. Chen, W. Zhou, and H. Li, “Transvg: End-to-end visual grounding with transformers,” in ICCV, 2021.
  26. L. Yang, Y. Xu, C. Yuan, W. Liu, B. Li, and W. Hu, “Improving visual grounding with visual-linguistic verification and iterative reasoning,” in CVPR, 2022.
  27. H. Akbari, S. Karaman, S. Bhargava, B. Chen, C. Vondrick, and S.-F. Chang, “Multi-level multimodal common semantic space for image-phrase grounding,” in CVPR, 2019.
  28. X. Liu, L. Li, S. Wang, Z.-J. Zha, L. Su, and Q. Huang, “Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding,” in ACMMM, 2019.
  29. L. Wang, J. Huang, Y. Li, K. Xu, Z. Yang, and D. Yu, “Improving weakly supervised visual grounding by contrastive knowledge distillation,” in CVPR, 2021.
  30. A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  31. J. Choe and H. Shim, “Attention-based dropout layer for weakly supervised object localization,” in CVPR, 2019.
  32. W. Gao, F. Wan, X. Pan, Z. Peng, Q. Tian, Z. Han, B. Zhou, and Q. Ye, “Ts-cam: Token semantic coupled attention map for weakly supervised object localization,” in ICCV, 2021.
  33. S. Gupta, S. Lakhotia, A. Rawat, and R. Tallamraju, “Vitol: Vision transformer for weakly supervised object localization,” in CVPR, 2022.
  34. E. Kim, S. Kim, J. Lee, H. Kim, and S. Yoon, “Bridging the gap between classification and localization for weakly supervised object localization,” in CVPR, 2022.
  35. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in CVPR, 2016.
  36. L. Ma, F. Zhao, H. Hong, L. Wang, and Y. Zhu, “Complementary parts contrastive learning for fine-grained weakly supervised object co-localization,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2023.
  37. J. Mai, M. Yang, and W. Luo, “Erasing integrated learning: A simple yet effective approach for weakly supervised object localization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8766–8775.
  38. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, “Training data-efficient image transformers & distillation through attention,” in ICML, 2021.
  39. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  40. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, pp. 32–73, 2017.
  41. X. Zhang, Y. Wei, G. Kang, Y. Yang, and T. Huang, “Self-produced guidance for weakly-supervised object localization,” in ECCV, 2018.
  42. Z. Fang, J. Wang, X. Hu, L. Liang, Z. Gan, L. Wang, Y. Yang, and Z. Liu, “Injecting semantic concepts into end-to-end image captioning,” in CVPR, 2022.
  43. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  44. Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in ICML, 2017.
  45. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, 2015.
  46. A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in CVPR, 2015.
  47. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
  48. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, 2002.
  49. S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in ACL, 2005.
  50. R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR, 2015.
  51. P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in ECCV, 2016.
  52. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  53. S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in CVPR, 2017.
  54. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  55. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
  56. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  57. L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen, “Deep learning for generic object detection: A survey,” International journal of computer vision, vol. 128, pp. 261–318, 2020.
  58. Y. Li and F. Ren, “Light-weight retinanet for object detection,” arXiv preprint arXiv:1905.10011, 2019.
  59. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  60. R. Ramos, B. Martins, D. Elliott, and Y. Kementchedjhieva, “Smallcap: lightweight image captioning prompted with retrieval augmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2840–2849.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chen Cai (24 papers)
  2. Suchen Wang (5 papers)
  3. Yi Wang (1038 papers)
  4. Kim-Hui Yap (28 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.