Semi-Supervised Image Captioning Considering Wasserstein Graph Matching (2403.17995v1)
Abstract: Image captioning can automatically generate captions for the given images, and the key challenge is to learn a mapping function from visual features to natural language features. Existing approaches are mostly supervised ones, i.e., each image has a corresponding sentence in the training set. However, considering that describing images always requires a huge of manpower, we usually have limited amount of described images (i.e., image-text pairs) and a large number of undescribed images in real-world applications. Thereby, a dilemma is the "Semi-Supervised Image Captioning". To solve this problem, we propose a novel Semi-Supervised Image Captioning method considering Wasserstein Graph Matching (SSIC-WGM), which turns to adopt the raw image inputs to supervise the generated sentences. Different from traditional single modal semi-supervised methods, the difficulty of semi-supervised cross-modal learning lies in constructing intermediately comparable information among heterogeneous modalities. In this paper, SSIC-WGM adopts the successful scene graphs as intermediate information, and constrains the generated sentences from two aspects: 1) inter-modal consistency. SSIC-WGM constructs the scene graphs of the raw image and generated sentence respectively, then employs the wasserstein distance to better measure the similarity between region embeddings of different graphs. 2) intra-modal consistency. SSIC-WGM takes the data augmentation techniques for the raw images, then constrains the consistency among augmented images and generated sentences. Consequently, SSIC-WGM combines the cross-modal pseudo supervision and structure invariant measure for efficiently using the undescribed images, and learns more reasonable mapping function.
- M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, “A comprehensive survey of deep learning for image captioning,” CoRR, vol. abs/1810.04020, 2018.
- A. Karpathy and F. Li, “Deep visual-semantic alignments for generating image descriptions,” in CVPR, Boston, MA, 2015, pp. 3128–3137.
- J. Zhang and Y. Peng, “Video captioning with object-aware spatio-temporal correlation and aggregation,” IEEE Trans. Image Process., vol. 29, pp. 6209–6222, 2020.
- A. Gupta, Y. Verma, and C. V. Jawahar, “Choosing linguistics over vision to describe images,” in Proceedings of the 26th AAAI Conference on Artificial Intelligence, Toronto, Ontario, 2012.
- V. Ordonez, G. Kulkarni, and T. L. Berg, “Im2text: Describing images using 1 million captioned photographs,” in Advances in Neural Information Processing Systems 24, Granada, Spain, 2011, pp. 1143–1151.
- Y. Yang, C. L. Teo, H. D. III, and Y. Aloimonos, “Corpus-guided sentence generation of natural images,” in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 2011, pp. 444–454.
- C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, 2013.
- E. P. Ijjina and C. K. Mohan, “Hybrid deep neural network model for human action recognition,” Appl. Soft Comput., vol. 46, pp. 936–952, 2016.
- R. Collobert and J. Weston, “A unified architecture for natural language processing: deep neural networks with multitask learning,” in Proceedings of the 25th International Conference Machine Learning, Helsinki, Finland, 2008, pp. 160–167.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, 2015.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30, Long Beach, CA, 2017, pp. 5998–6008.
- J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” in Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, 2015.
- K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, Lille, France, 2015, pp. 2048–2057.
- F. Sammani and M. Elsayed, “Look and modify: Modification networks for image captioning,” in BMVC, Cardiff, UK, 2019, p. 75.
- K. Cho, B. van Merrienboer, cCaglar Gulccehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014, pp. 1724–1734.
- Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. Salakhutdinov, “Review networks for caption generation,” in NeurIPS, Barcelona, Spain, 2016, pp. 2361–2369.
- Y. Yang, H. Wei, H. Zhu, D. Yu, H. Xiong, and J. Yang, “Exploiting cross-modal prediction and relation consistency for semisupervised image captioning,” IEEE Transactions on Cybernetics, 2022.
- J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in CVPR, Honolulu, HI, 2017, pp. 3242–3250.
- L. Huang, W. Wang, J. Chen, and X. Wei, “Attention on attention for image captioning,” in ICCV, Seoul, Korea, 2019, pp. 4633–4642.
- F. Sammani and L. Melas-Kyriazi, “Show, edit and tell: A framework for editing image captions,” in CVPR, Seattle, WA, 2020, pp. 4807–4815.
- P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, Salt Lake City, UT, 2018, pp. 6077–6086.
- T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proceedings of the 15th European Conference Computer Vision, Munich, Germany, 2018, pp. 711–727.
- T. B. Hashimoto, K. Guu, Y. Oren, and P. Liang, “A retrieve-and-edit framework for predicting structured outputs,” in NeurIPS, Montreal, Canada, 2018, pp. 10 073–10 083.
- H. B. Barlow, “Vision: A computational investigation into the human representation and processing of visual information,” Journal of Mathematical Psychology, vol. 27, no. 1, pp. 107–110, 1983.
- Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,” in CVPR, Long Beach, CA, 2019, pp. 4125–4134.
- J. Gu, S. R. Joty, J. Cai, H. Zhao, X. Yang, and G. Wang, “Unpaired image captioning via scene graph alignments,” in ICCV, Seoul, Korea, 2019, pp. 10 322–10 331.
- N. C. Mithun, R. Panda, E. E. Papalexakis, and A. K. Roy-Chowdhury, “Webly supervised joint embedding for cross-modal image-text retrieval,” in ACMMM, Seoul, Republic of Korea, 2018, pp. 1856–1864.
- P. Huang, G. Kang, W. Liu, X. Chang, and A. G. Hauptmann, “Annotation efficient cross-modal retrieval with adversarial attentive alignment,” in ACMMM, Nice, France, 2019, pp. 1758–1767.
- J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 1219–1228.
- Y. Wang, C. Liu, X. Zeng, and A. L. Yuille, “Scene graph parsing as dependency parsing,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, 2018, pp. 397–407.
- L. Rüschendorf, “Optimal transport. old and new,” Jahresbericht der Deutschen Mathematiker-Vereinigung, vol. 111, no. 2, pp. 18–21, 2009.
- S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde, “Optimal mass transport: Signal processing and machine-learning applications,” IEEE Signal Process. Mag., vol. 34, no. 4, pp. 43–59, 2017.
- M. Togninalli, M. E. Ghisu, F. Llinares-Lopez, B. Rieck, and K. M. Borgwardt, “Wasserstein weisfeiler-lehman graph kernels,” in Advances in Neural Information Processing Systems 32, Vancouver, Canada, 2019, pp. 6436–6446.
- G. French, M. Mackiewicz, and M. H. Fisher, “Self-ensembling for visual domain adaptation,” in ICLR, Vancouver, Canada, 2018.
- S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in CVPR, Honolulu, HI, USA, 2017, pp. 1179–1195.
- Y. Zhou, M. Wang, D. Liu, Z. Hu, and H. Zhang, “More grounded image captioning by distilling image-text matching model,” in CVPR, 2020, pp. 4776–4785.
- T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, Zurich, Switzerland, 2014, pp. 740–755.
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguistics, vol. 2, pp. 67–78, 2014.
- P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, “Collective generation of natural image descriptions,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea, 2012, pp. 359–368.
- S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, “Composing simple image descriptions using web-scale n-grams,” in Proceedings of the 15th Conference on Computational Natural Language Learning, Portland, Oregon, 2011, pp. 220–228.
- B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S. C. Zhu, “I2T: image parsing to text description,” Proceedings of the IEEE, vol. 98, no. 8, pp. 1485–1508, 2010.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, Boston, MA, 2015, pp. 3156–3164.
- Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in CVPR, Las Vegas, NV, 2016, pp. 4651–4659.
- D. Teney, L. Liu, and A. van den Hengel, “Graph-structured representations for visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 3233–3241.
- D. Liu, H. Zhang, Z. Zha, and F. Wu, “Learning to assemble neural module tree networks for visual grounding,” in Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), 2019, pp. 4672–4681.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, Toulon, France, 2017.
- G. Becigneul, O. Ganea, B. Chen, R. Barzilay, and T. S. Jaakkola, “Optimal transport graph neural networks,” CoRR, vol. abs/2006.04804, 2020.
- Y. Grandvalet and Y. Bengio, “Semi-supervised learning by entropy minimization,” in NeurIPS, British Columbia, Canada, 2004, pp. 529–536.
- E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, “Pseudo-labeling and confirmation bias in deep semi-supervised learning,” in IJCNN, Glasgow, United Kingdom, 2020, pp. 1–8.
- W. Xi, X. Song, W. Guo, and Y. Yang, “Robust semi-supervised learning for self-learning open-world classes,” in 2023 IEEE International Conference on Data Mining (ICDM). IEEE, 2023, pp. 658–667.
- P. Bachman, O. Alsharif, and D. Precup, “Learning with pseudo-ensembles,” in NeurIPS, Quebec, Canada, 2014, pp. 3365–3373.
- Y. Yang, H. Wei, Z.-Q. Sun, G.-Y. Li, Y. Zhou, H. Xiong, and J. Yang, “S2osc: A holistic semi-supervised approach for open set classification,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 16, no. 2, pp. 1–27, 2021.
- Q. Xie, Z. Dai, E. H. Hovy, T. Luong, and Q. Le, “Unsupervised data augmentation for consistency training,” in NeurIPS, virtual, 2020.
- D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel, “Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring,” in ICLR, Addis Ababa, Ethiopia, 2020.
- R. Ghani, “Combining labeled and unlabeled data for multiclass text categorization,” in ICML, C. Sammut and A. G. Hoffmann, Eds., New South Wales, Australia, 2002, pp. 187–194.
- V. Sindhwani and D. S. Rosenberg, “An rkhs for multi-view learning and manifold co-regularization,” in ICML, 2008, pp. 976–983.
- Y. Yang, K.-T. Wang, D.-C. Zhan, H. Xiong, and Y. Jiang, “Comprehensive semi-supervised multi-modal learning.” in IJCAI, 2019, pp. 4092–4098.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 770–778.
- S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
- R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR, Boston, USA, 2015, pp. 4566–4575.
- X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in CVPR, Long Beach, CA, 2019, pp. 10 685–10 694.
- S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt, “Graph kernels,” J. Mach. Learn. Res., vol. 11, pp. 1201–1242, 2010.
- P. Haussler, “Convolution kernels on discrete structures,” Technical Report. University of California, 1999.
- Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image retrieval,” Int. J. Comput. Vis., vol. 40, no. 2, pp. 99–121, 2000.
- E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation policies from data,” CoRR, vol. abs/1805.09501, 2018.
- G. Xu, S. Niu, M. Tan, Y. Luo, Q. Du, and Q. Wu, “Towards accurate text-based image captioning with content diversity exploration,” in CVPR, virtual, 2021, pp. 12 637–12 646.
- S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” in NeurIPS, Vancouver, Canada, 2019, pp. 11 135–11 145.
- L. Huang, W. Wang, Y. Xia, and J. Chen, “Adaptively aligned image captioning via adaptive attention time,” in NeurIPS, Vancouver, Canada, 2019, pp. 8940–8949.
- A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” TPAMI, vol. 39, no. 4, pp. 664–676, 2017.
- G. Xu, S. Niu, M. Tan, Y. Luo, Q. Du, and Q. Wu, “Towards accurate text-based image captioning with content diversity exploration,” CoRR, vol. abs/2105.03236, 2021.
- X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, and R. Ji, “Rstnet: Captioning with adaptive attention on visual and non-visual words,” in CVPR, virtual, 2021, pp. 15 465–15 474.
- D. Guo, Y. Wang, P. Song, and M. Wang, “Recurrent relational memory network for unsupervised image captioning,” in IJCAI, 2020, pp. 920–926.
- U. H. Y. Ushiku and A. H. T. W. Y. Matsumoto, “Removing word-level spurious alignment between images and pseudo-captions in unsupervised image captioning,” in EACL, 2021, pp. 3692–3702.
- L. Iro, R. Christian, and N. Nassir, “Towards unsupervised image captioning with shared multimodal embeddings,” in ICCV, 2019, pp. 7414–7424.
- H. Ben, Y. Pan, Y. Li, T. Yao, R. Hong, M. Wang, and T. Mei, “Unpaired image captioning with semantic-constrained self-learning,” Trans. Multimedia, 2021.
- X. Chen, M. Jiang, and Q. Zhao, “Self-distillation for few-shot image captioning,” in WACV, Waikoloa, HI, 2021, pp. 545–555.
- A. Jain, P. R. Samala, P. Jyothi, D. Mittal, and M. K. Singh, “Perturb, predict & paraphrase: Semi-supervised learning using noisy student for image captioning,” in IJCAI, Virtual, 2021, pp. 758–764.
- K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, Philadelphia, USA, 2002, pp. 311–318.
- S. Banerjee and A. Lavie, “METEOR: an automatic metric for MT evaluation with improved correlation with human judgments,” in IEEMMT, AnnArbor, USA, 2005, pp. 65–72.
- P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: semantic propositional image caption evaluation,” in ECCV, Amsterdam, The Netherlands, 2016, pp. 382–398.
- J. Giménez and L. Màrquez, “Linguistic features for automatic evaluation of heterogenous MT systems,” in ACL Workshop, C. Callison-Burch, P. Koehn, C. S. Fordyce, and C. Monz, Eds., Prague, Czech Republic, 2007, pp. 256–264.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.