Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training (2306.08789v1)
Abstract: Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches.
- P. Anderson, X. He, C. Buehler, D. Teney, and Z. Lei, “Bottom-up and top-down attention for image captioning and visual question answering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” Computer Vision and Pattern Recognition, 2018.
- L. Zhang, X. Chang, J. Liu, M. Luo, Z. Li, L. Yao, and A. Hauptmann, “Tn-zstad: Transferable network for zero-shot temporal activity detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- H. Wang and L. Wang, “Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection,” IEEE Transactions on Image Processing, vol. 27, pp. 4382–4394, 2018.
- M. Li, P.-Y. Huang, X. Chang, J. Hu, Y. Yang, and A. Hauptmann, “Video pivoting unsupervised multi-modal machine translation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann, “A comprehensive survey of scene graphs: Generation and application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1–26, 2021.
- W. Wang, V. W. Zheng, H. Yu, and C. Miao, “A survey of zero-shot learning: Settings, methods, and applications,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 10, no. 2, pp. 1–37, 2019.
- C. Yan, X. Chang, Z. Li, W. Guan, Z. Ge, L. Zhu, and Q. Zheng, “Zeronas: Differentiable generative adversarial networks search for zero-shot learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12, pp. 9733–9740, 2021.
- S. Gur, N. Neverova, C. Stauffer, S. N. Lim, D. Kiela, and A. Reiter, “Cross-modal retrieval augmentation for multi-modal classification,” in Conference on Empirical Methods in Natural Language Processing, 2021, pp. 111–123.
- F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “Vse++: Improving visual-semantic embeddings with hard negatives,” in British Machine Vision Conference, 2017.
- N. Messina, G. Amato, A. Esuli, F. Falchi, C. Gennaro, and S. Marchand-Maillet, “Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 17, no. 4, pp. 1–23, 2021.
- H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, and J. Han, “Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 655–12 663.
- P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5579–5588.
- L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” IEEE Computer Vision and Pattern Recognition, 2016.
- Z. Zheng, Z. Liang, M. Garrett, Y. Yi, and Y. D. Shen, “Dual-path convolutional image-text embedding,” ACM Transactions on Multimedia Computing Communications and Applications, vol. 16, no. 2, 2017.
- J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang, “Dual encoding for zero-example video retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9346–9355.
- L. Wang, Y. Li, J. Huang, and S. Lazebnik, “Learning two-branch neural networks for image-text matching tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 394–407, 2018.
- C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling matters in deep embedding learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2840–2848.
- A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 664–676, 2017.
- Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Hierarchical multimodal lstm for dense visual-semantic embedding,” IEEE International Conference on Computer Vision, pp. 1899–1907, 2017.
- K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 201–216.
- Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao, “Camp: Cross-modal adaptive message passing for text-image retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5764–5773.
- K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4654–4662.
- H. Diao, Y. Zhang, L. Ma, and H. Lu, “Similarity reasoning and filtration for image-text matching,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1218–1226.
- M. Wray, D. Larlus, G. Csurka, and D. Damen, “Fine-grained action retrieval through multiple parts-of-speech embeddings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 450–459.
- G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 336–11 344.
- X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in European Conference on Computer Vision. Springer, 2020, pp. 121–137.
- A. Miech, J.-B. Alayrac, I. Laptev, J. Sivic, and A. Zisserman, “Thinking fast and slow: Efficient text-to-visual retrieval with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9826–9836.
- Y. Zhang, W. Zhou, M. Wang, Q. Tian, and H. Li, “Deep relation embedding for cross-modal retrieval,” IEEE Transactions on Image Processing, vol. 30, pp. 617–627, 2020.
- J. Li, L. Liu, L. Niu, and L. Zhang, “Memorize, associate and match: Embedding enhancement via fine-grained alignment for image-text retrieval,” IEEE Transactions on Image Processing, vol. 30, pp. 9193–9207, 2021.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, 2021.
- C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” arXiv preprint arXiv:2102.05918, 2021.
- L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” ArXiv, vol. abs/1908.03557, 2019.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in Neural Information Processing Systems, pp. 13–23, 2019.
- Y.-C. Chen, L. Li, L. Yu, E. A. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” European Conference on Computer Vision, pp. 104–120, 2020.
- H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2019, pp. 5100–5111.
- L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, “Unified vision-language pre-training for image captioning and vqa,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 041–13 049.
- S. Ren, K. He, B. R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1137–1149, 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186, 2019.
- B. Cao, A. Araujo, and J. Sim, “Unifying deep local and global features for image search,” in European Conference on Computer Vision. Springer, 2020, pp. 726–743.
- H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3456–3465.
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and L. C. Zitnick, “Microsoft coco: Common objects in context,” European Conference on Computer Vision, pp. 740–755, 2014.
- huggingface, “State-of-the-art natural language processing for jax, pytorch and tensorflow,” https://github.com/huggingface/transformers.
- L. Ma, W. Jiang, Z. Jie, Y.-G. Jiang, and W. Liu, “Matching image and sentence with multi-faceted representations,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, pp. 2250–2261, 2020.
- J. Wei, Y. Yang, X. Xu, X. Zhu, and H. T. Shen, “Universal weighting metric learning for cross-modal retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 01, pp. 1–1, 2021.
- K. Wen, X. Gu, and Q. Cheng, “Learning dual semantic relations with graph attention for image-text matching,” IEEE transactions on circuits and systems for video technology, vol. 31, no. 7, pp. 2866–2879, 2020.
- Z. Ji, H. Wang, J. Han, and Y. Pang, “Sman: Stacked multimodal attention network for cross-modal image–text retrieval,” IEEE transactions on cybernetics, vol. 52, no. 2, pp. 1086–1097, 2020.
- L. Qu, M. Liu, D. Cao, L. Nie, and Q. Tian, “Context-aware multi-view summarization network for image-text matching,” in Proceedings of the ACM international conference on multimedia, 2020, pp. 1047–1055.
- Z. Huang, G. Niu, X. Liu, W. Ding, X. Xiao, H. Wu, and X. Peng, “Learning with noisy correspondence for cross-modal matching,” Advances in Neural Information Processing Systems, vol. 34, pp. 29 406–29 419, 2021.
- S. Yan, L. Yu, and Y. Xie, “Discrete-continuous action space policy gradient-based attention for image-text matching,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8096–8105.
- Y. Cheng, X. Zhu, J. Qian, F. Wen, and P. Liu, “Cross-modal graph matching network for image-text retrieval,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 18, no. 4, pp. 1–23, 2022.
- K. Zhang, Z. Mao, A.-A. Liu, and Y. Zhang, “Unified adaptive relevance distinguishable attention network for image-text matching,” IEEE Transactions on Multimedia, vol. 25, pp. 1320–1332, 2022.
- S. Ren, J. Lin, G. Zhao, R. Men, A. Yang, J. Zhou, X. Sun, and H. Yang, “Learning relation alignment for calibrated cross-modal retrieval,” in Proceedings of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 2021, pp. 514–524.
- Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu, “Large-scale adversarial training for vision-and-language representation learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 6616–6628, 2020.
- F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang, “Ernie-vil: Knowledge enhanced vision-language representations through scene graphs,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3208–3216.
- J. Chen, H. Hu, H. Wu, Y. Jiang, and C. Wang, “Learning the best pooling strategy for visual semantic embedding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 789–15 798.
- Z. Huang, Z. Zeng, Y. Huang, B. Liu, D. Fu, and J. Fu, “Seeing out of the box: End-to-end pre-training for vision-language representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 976–12 985.
- H. Xue, Y. Huang, B. Liu, H. Peng, J. Fu, H. Li, and J. Luo, “Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training,” Advances in Neural Information Processing Systems, vol. 34, pp. 4514–4528, 2021.
- W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 5583–5594.
- Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu, “Pixel-bert: Aligning image pixels with text by deep multi-modal transformers,” arXiv preprint arXiv:2004.00849, 2020.
- S. Sun, Y.-C. Chen, L. Li, S. Wang, Y. Fang, and J. Liu, “Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval,” in Conference of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 982–997.
- J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021.
- J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 671–15 680.
- M. Cheng, Y. Sun, L. Wang, X. Zhu, K. Yao, J. Chen, G. Song, J. Han, J. Liu, E. Ding et al., “Vista: Vision and scene text aggregation for cross-modal retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5184–5193.
- T. Yu, H. Fei, and P. Li, “Cross-probe bert for fast cross-modal search,” in International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 2178–2183.
- X. Pan, T. Ye, D. Han, S. Song, and G. Huang, “Contrastive language-image pre-training with knowledge graphs,” in Advances in Neural Information Processing Systems, 2022.
- Chong Liu (104 papers)
- Yuqi Zhang (54 papers)
- Hongsong Wang (25 papers)
- Weihua Chen (35 papers)
- Fan Wang (312 papers)
- Yan Huang (180 papers)
- Yi-Dong Shen (12 papers)
- Liang Wang (512 papers)