Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text Data-Centric Image Captioning with Interactive Prompts (2403.19193v1)

Published 28 Mar 2024 in cs.CV

Abstract: Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and LLMs (e.g., CLIP) and large-scale generative LLMs (e.g., GPT-2) have shown strong performances in various tasks, which also provide some new solutions for image captioning with web paired data, unpaired data or even text-only data. Among them, the mainstream solution is to project image embeddings into the text embedding space with the assistance of consistent representations between image-text pairs from the CLIP model. However, the current methods still face several challenges in adapting to the diversity of data configurations in a unified solution, accurately estimating image-text embedding bias, and correcting unsatisfactory prediction results in the inference stage. This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap. 1) We consider four different settings which gradually reduce the dependence on paired data. 2) We construct a mapping module driven by multivariate Gaussian distribution to mitigate the modality gap, which is applicable to the above four different settings. 3) We propose a prompt interaction module that can incorporate optional prompt information before generating captions. Extensive experiments show that our TIPCap outperforms other weakly or unsupervised image captioning methods and achieves a new state-of-the-art performance on two widely used datasets, i.e., MS-COCO and Flickr30K.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3156–3164.
  2. K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
  3. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6077–6086.
  4. L. Huang, W. Wang, J. Chen, and X. Wei, “Attention on attention for image captioning,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 4633–4642.
  5. Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 968–10 977.
  6. Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C. Lin, and R. Ji, “Dual-level collaborative transformer for image captioning,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 2286–2293.
  7. Y. Wang, J. Xu, and Y. Sun, “End-to-end transformer based model for image captioning,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 2585–2594.
  8. L. Wu, M. Xu, L. Sang, T. Yao, and T. Mei, “Noise augmented double-stream graph convolutional networks for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 8, pp. 3118–3127, 2021. [Online]. Available: https://doi.org/10.1109/TCSVT.2020.3036860
  9. W. Jiang, W. Zhou, and H. Hu, “Double-stream position learning transformer network for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 11, pp. 7706–7718, 2022. [Online]. Available: https://doi.org/10.1109/TCSVT.2022.3181490
  10. S. Cao, G. An, Z. Zheng, and Z. Wang, “Vision-enhanced and consensus-aware transformer for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. [Online]. Available: https://doi.org/10.1109/TCSVT.2022.3178844
  11. J. Zhang, Y. Xie, W. Ding, and Z. Wang, “Cross on cross attention: Deep fusion transformer for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. [Online]. Available: https://doi.org/10.1109/TCSVT.2023.3243725
  12. T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
  13. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguistics, vol. 2, pp. 67–78, 2014.
  14. Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4125–4134.
  15. I. Laina, C. Rupprecht, and N. Navab, “Towards unsupervised image captioning with shared multimodal embeddings,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7413–7423.
  16. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. N. Am. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2019, pp. 4171–4186.
  17. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
  18. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020.
  19. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
  20. C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 4904–4916.
  21. J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 12 888–12 900.
  22. S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3558–3568.
  23. Y. Tewel, Y. Shalev, I. Schwartz, and L. Wolf, “Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 17 897–17 907.
  24. Y. Su, T. Lan, Y. Liu, F. Liu, D. Yogatama, Y. Wang, L. Kong, and N. Collier, “Language models can see: Plugging visual controls in text generation,” ArXiv, vol. abs/2205.02655, 2022.
  25. D. Nukrai, R. Mokady, and A. Globerson, “Text-only training for image captioning using noise-injected CLIP,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2022, pp. 4055–4063.
  26. W. Li, L. Zhu, L. Wen, and Y. Yang, “Decap: Decoding clip latents for zero-shot captioning via text-only training,” in Proc. Int. Conf. Learn. Representations, 2023.
  27. S. Gu, C. Clark, and A. Kembhavi, “I can’t believe there’s no images! learning visual tasks using only language data,” ArXiv, vol. abs/2211.09778, 2022.
  28. J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” ArXiv, vol. abs/2301.12597, 2023.
  29. P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 23 318–23 340.
  30. X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” ArXiv, vol. abs/2304.06718, 2023.
  31. K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proc. Annu. Meet. Assoc. Comput. Linguist., 2002, pp. 311–318.
  32. K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2014, pp. 1724–1734.
  33. H. Chen, G. Ding, Z. Lin, S. Zhao, and J. Han, “Show, observe and tell: Attribute-driven attention model for image captioning,” in Proc. Int. Joint Conf. Artif. Intell., 2018, pp. 606–612.
  34. Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4651–4659.
  35. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  36. S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
  37. X. Dong, C. Long, W. Xu, and C. Xiao, “Dual graph convolutional networks with transformer and curriculum learning for image captioning,” in Proc. ACM Int. Conf. Multimed., 2021, pp. 2615–2624.
  38. W. Nie, J. Li, N. Xu, A. Liu, X. Li, and Y. Zhang, “Triangle-reward reinforcement learning: A visual-linguistic semantic alignment for image captioning,” in Proc. ACM Int. Conf. Multimed., 2021, pp. 4510–4518.
  39. S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” in Proc. Adv. neural inf. proces. syst., 2019, pp. 11 135–11 145.
  40. J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480, 2020. [Online]. Available: https://doi.org/10.1109/TCSVT.2019.2947482
  41. M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 575–10 584.
  42. J. Wang, Y. Zhang, M. Yan, J. Zhang, and J. Sang, “Zero-shot image captioning by anchor-augmented vision-language space alignment,” ArXiv, vol. abs/2211.07275, 2022.
  43. H. Tan and M. Bansal, “LXMERT: learning cross-modality encoder representations from transformers,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2019, pp. 5099–5110.
  44. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Proc. Adv. neural inf. proces. syst., 2019, pp. 13–23.
  45. Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “UNITER: universal image-text representation learning,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 104–120.
  46. X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in Proc. Eur. Conf. Comput. Vis., ser. Lecture Notes in Computer Science, vol. 12375, 2020, pp. 121–137.
  47. J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Proc. Adv. neural inf. proces. syst., 2021, pp. 9694–9705.
  48. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. Int. Conf. Learn. Representations, 2014.
  49. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Proc. Adv. neural inf. proces. syst., 2022.
  50. A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3128–3137.
  51. B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li, “YFCC100M: the new data in multimedia research,” Commun. ACM, vol. 59, no. 2, pp. 64–73, 2016.
  52. M. J. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Proc. ACL Workshop Statistical Machine Translation, 2014, pp. 376–380.
  53. C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. ACL Workshop Text Summarization Branches Out, 2004, pp. 74–81.
  54. R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4566–4575.
  55. P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in Proc. Eur. Conf. Comput. Vis., vol. 9909, 2016, pp. 382–398.
  56. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Proc. Adv. neural inf. proces. syst., 2019, pp. 8024–8035.
  57. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations.   OpenReview.net, 2019.
  58. L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao, “Unified vision-language pre-training for image captioning and VQA,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 13 041–13 049.
  59. R. Mokady, A. Hertz, and A. H. Bermano, “Clipcap: CLIP prefix for image captioning,” ArXiv, vol. abs/2111.09734, 2021.
  60. H. Agrawal, P. Anderson, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, and S. Lee, “nocaps: novel object captioning at scale,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 8947–8956.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yiyu Wang (15 papers)
  2. Hao Luo (112 papers)
  3. Jungang Xu (9 papers)
  4. Yingfei Sun (29 papers)
  5. Fan Wang (313 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com