Text Data-Centric Image Captioning with Interactive Prompts (2403.19193v1)
Abstract: Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. Recently, large-scale vision and LLMs (e.g., CLIP) and large-scale generative LLMs (e.g., GPT-2) have shown strong performances in various tasks, which also provide some new solutions for image captioning with web paired data, unpaired data or even text-only data. Among them, the mainstream solution is to project image embeddings into the text embedding space with the assistance of consistent representations between image-text pairs from the CLIP model. However, the current methods still face several challenges in adapting to the diversity of data configurations in a unified solution, accurately estimating image-text embedding bias, and correcting unsatisfactory prediction results in the inference stage. This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap. 1) We consider four different settings which gradually reduce the dependence on paired data. 2) We construct a mapping module driven by multivariate Gaussian distribution to mitigate the modality gap, which is applicable to the above four different settings. 3) We propose a prompt interaction module that can incorporate optional prompt information before generating captions. Extensive experiments show that our TIPCap outperforms other weakly or unsupervised image captioning methods and achieves a new state-of-the-art performance on two widely used datasets, i.e., MS-COCO and Flickr30K.
- O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3156–3164.
- K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 2048–2057.
- P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6077–6086.
- L. Huang, W. Wang, J. Chen, and X. Wei, “Attention on attention for image captioning,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 4633–4642.
- Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 968–10 977.
- Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C. Lin, and R. Ji, “Dual-level collaborative transformer for image captioning,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 2286–2293.
- Y. Wang, J. Xu, and Y. Sun, “End-to-end transformer based model for image captioning,” in Proc. AAAI Conf. Artif. Intell., 2022, pp. 2585–2594.
- L. Wu, M. Xu, L. Sang, T. Yao, and T. Mei, “Noise augmented double-stream graph convolutional networks for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 8, pp. 3118–3127, 2021. [Online]. Available: https://doi.org/10.1109/TCSVT.2020.3036860
- W. Jiang, W. Zhou, and H. Hu, “Double-stream position learning transformer network for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 11, pp. 7706–7718, 2022. [Online]. Available: https://doi.org/10.1109/TCSVT.2022.3181490
- S. Cao, G. An, Z. Zheng, and Z. Wang, “Vision-enhanced and consensus-aware transformer for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 10, pp. 7005–7018, 2022. [Online]. Available: https://doi.org/10.1109/TCSVT.2022.3178844
- J. Zhang, Y. Xie, W. Ding, and Z. Wang, “Cross on cross attention: Deep fusion transformer for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 8, pp. 4257–4268, 2023. [Online]. Available: https://doi.org/10.1109/TCSVT.2023.3243725
- T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguistics, vol. 2, pp. 67–78, 2014.
- Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4125–4134.
- I. Laina, C. Rupprecht, and N. Navab, “Towards unsupervised image captioning with shared multimodal embeddings,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7413–7423.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. N. Am. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2019, pp. 4171–4186.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763.
- C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 4904–4916.
- J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 12 888–12 900.
- S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3558–3568.
- Y. Tewel, Y. Shalev, I. Schwartz, and L. Wolf, “Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 17 897–17 907.
- Y. Su, T. Lan, Y. Liu, F. Liu, D. Yogatama, Y. Wang, L. Kong, and N. Collier, “Language models can see: Plugging visual controls in text generation,” ArXiv, vol. abs/2205.02655, 2022.
- D. Nukrai, R. Mokady, and A. Globerson, “Text-only training for image captioning using noise-injected CLIP,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2022, pp. 4055–4063.
- W. Li, L. Zhu, L. Wen, and Y. Yang, “Decap: Decoding clip latents for zero-shot captioning via text-only training,” in Proc. Int. Conf. Learn. Representations, 2023.
- S. Gu, C. Clark, and A. Kembhavi, “I can’t believe there’s no images! learning visual tasks using only language data,” ArXiv, vol. abs/2211.09778, 2022.
- J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” ArXiv, vol. abs/2301.12597, 2023.
- P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 23 318–23 340.
- X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” ArXiv, vol. abs/2304.06718, 2023.
- K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proc. Annu. Meet. Assoc. Comput. Linguist., 2002, pp. 311–318.
- K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2014, pp. 1724–1734.
- H. Chen, G. Ding, Z. Lin, S. Zhao, and J. Han, “Show, observe and tell: Attribute-driven attention model for image captioning,” in Proc. Int. Joint Conf. Artif. Intell., 2018, pp. 606–612.
- Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4651–4659.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.
- X. Dong, C. Long, W. Xu, and C. Xiao, “Dual graph convolutional networks with transformer and curriculum learning for image captioning,” in Proc. ACM Int. Conf. Multimed., 2021, pp. 2615–2624.
- W. Nie, J. Li, N. Xu, A. Liu, X. Li, and Y. Zhang, “Triangle-reward reinforcement learning: A visual-linguistic semantic alignment for image captioning,” in Proc. ACM Int. Conf. Multimed., 2021, pp. 4510–4518.
- S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” in Proc. Adv. neural inf. proces. syst., 2019, pp. 11 135–11 145.
- J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480, 2020. [Online]. Available: https://doi.org/10.1109/TCSVT.2019.2947482
- M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 575–10 584.
- J. Wang, Y. Zhang, M. Yan, J. Zhang, and J. Sang, “Zero-shot image captioning by anchor-augmented vision-language space alignment,” ArXiv, vol. abs/2211.07275, 2022.
- H. Tan and M. Bansal, “LXMERT: learning cross-modality encoder representations from transformers,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2019, pp. 5099–5110.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Proc. Adv. neural inf. proces. syst., 2019, pp. 13–23.
- Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “UNITER: universal image-text representation learning,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 104–120.
- X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in Proc. Eur. Conf. Comput. Vis., ser. Lecture Notes in Computer Science, vol. 12375, 2020, pp. 121–137.
- J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Proc. Adv. neural inf. proces. syst., 2021, pp. 9694–9705.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. Int. Conf. Learn. Representations, 2014.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Proc. Adv. neural inf. proces. syst., 2022.
- A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3128–3137.
- B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li, “YFCC100M: the new data in multimedia research,” Commun. ACM, vol. 59, no. 2, pp. 64–73, 2016.
- M. J. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Proc. ACL Workshop Statistical Machine Translation, 2014, pp. 376–380.
- C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. ACL Workshop Text Summarization Branches Out, 2004, pp. 74–81.
- R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4566–4575.
- P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” in Proc. Eur. Conf. Comput. Vis., vol. 9909, 2016, pp. 382–398.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Proc. Adv. neural inf. proces. syst., 2019, pp. 8024–8035.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Representations. OpenReview.net, 2019.
- L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao, “Unified vision-language pre-training for image captioning and VQA,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 13 041–13 049.
- R. Mokady, A. Hertz, and A. H. Bermano, “Clipcap: CLIP prefix for image captioning,” ArXiv, vol. abs/2111.09734, 2021.
- H. Agrawal, P. Anderson, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, and S. Lee, “nocaps: novel object captioning at scale,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 8947–8956.
- Yiyu Wang (15 papers)
- Hao Luo (112 papers)
- Jungang Xu (9 papers)
- Yingfei Sun (29 papers)
- Fan Wang (312 papers)