Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training (2306.08789v1)

Published 15 Jun 2023 in cs.CV

Abstract: Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. P. Anderson, X. He, C. Buehler, D. Teney, and Z. Lei, “Bottom-up and top-down attention for image captioning and visual question answering,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  2. T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” Computer Vision and Pattern Recognition, 2018.
  3. L. Zhang, X. Chang, J. Liu, M. Luo, Z. Li, L. Yao, and A. Hauptmann, “Tn-zstad: Transferable network for zero-shot temporal activity detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  4. H. Wang and L. Wang, “Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection,” IEEE Transactions on Image Processing, vol. 27, pp. 4382–4394, 2018.
  5. M. Li, P.-Y. Huang, X. Chang, J. Hu, Y. Yang, and A. Hauptmann, “Video pivoting unsupervised multi-modal machine translation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  6. X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann, “A comprehensive survey of scene graphs: Generation and application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1–26, 2021.
  7. W. Wang, V. W. Zheng, H. Yu, and C. Miao, “A survey of zero-shot learning: Settings, methods, and applications,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 10, no. 2, pp. 1–37, 2019.
  8. C. Yan, X. Chang, Z. Li, W. Guan, Z. Ge, L. Zhu, and Q. Zheng, “Zeronas: Differentiable generative adversarial networks search for zero-shot learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12, pp. 9733–9740, 2021.
  9. S. Gur, N. Neverova, C. Stauffer, S. N. Lim, D. Kiela, and A. Reiter, “Cross-modal retrieval augmentation for multi-modal classification,” in Conference on Empirical Methods in Natural Language Processing, 2021, pp. 111–123.
  10. F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “Vse++: Improving visual-semantic embeddings with hard negatives,” in British Machine Vision Conference, 2017.
  11. N. Messina, G. Amato, A. Esuli, F. Falchi, C. Gennaro, and S. Marchand-Maillet, “Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 17, no. 4, pp. 1–23, 2021.
  12. H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, and J. Han, “Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 655–12 663.
  13. P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5579–5588.
  14. L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” IEEE Computer Vision and Pattern Recognition, 2016.
  15. Z. Zheng, Z. Liang, M. Garrett, Y. Yi, and Y. D. Shen, “Dual-path convolutional image-text embedding,” ACM Transactions on Multimedia Computing Communications and Applications, vol. 16, no. 2, 2017.
  16. J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang, “Dual encoding for zero-example video retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9346–9355.
  17. L. Wang, Y. Li, J. Huang, and S. Lazebnik, “Learning two-branch neural networks for image-text matching tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 394–407, 2018.
  18. C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling matters in deep embedding learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2840–2848.
  19. A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 664–676, 2017.
  20. Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Hierarchical multimodal lstm for dense visual-semantic embedding,” IEEE International Conference on Computer Vision, pp. 1899–1907, 2017.
  21. K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 201–216.
  22. Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao, “Camp: Cross-modal adaptive message passing for text-image retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5764–5773.
  23. K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4654–4662.
  24. H. Diao, Y. Zhang, L. Ma, and H. Lu, “Similarity reasoning and filtration for image-text matching,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1218–1226.
  25. M. Wray, D. Larlus, G. Csurka, and D. Damen, “Fine-grained action retrieval through multiple parts-of-speech embeddings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 450–459.
  26. G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 336–11 344.
  27. X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in European Conference on Computer Vision.   Springer, 2020, pp. 121–137.
  28. A. Miech, J.-B. Alayrac, I. Laptev, J. Sivic, and A. Zisserman, “Thinking fast and slow: Efficient text-to-visual retrieval with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9826–9836.
  29. Y. Zhang, W. Zhou, M. Wang, Q. Tian, and H. Li, “Deep relation embedding for cross-modal retrieval,” IEEE Transactions on Image Processing, vol. 30, pp. 617–627, 2020.
  30. J. Li, L. Liu, L. Niu, and L. Zhang, “Memorize, associate and match: Embedding enhancement via fine-grained alignment for image-text retrieval,” IEEE Transactions on Image Processing, vol. 30, pp. 9193–9207, 2021.
  31. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, 2021.
  32. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” arXiv preprint arXiv:2102.05918, 2021.
  33. L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” ArXiv, vol. abs/1908.03557, 2019.
  34. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in Neural Information Processing Systems, pp. 13–23, 2019.
  35. Y.-C. Chen, L. Li, L. Yu, E. A. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” European Conference on Computer Vision, pp. 104–120, 2020.
  36. H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2019, pp. 5100–5111.
  37. L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, “Unified vision-language pre-training for image captioning and vqa,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 041–13 049.
  38. S. Ren, K. He, B. R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1137–1149, 2017.
  39. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186, 2019.
  40. B. Cao, A. Araujo, and J. Sim, “Unifying deep local and global features for image search,” in European Conference on Computer Vision.   Springer, 2020, pp. 726–743.
  41. H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3456–3465.
  42. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
  43. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and L. C. Zitnick, “Microsoft coco: Common objects in context,” European Conference on Computer Vision, pp. 740–755, 2014.
  44. huggingface, “State-of-the-art natural language processing for jax, pytorch and tensorflow,” https://github.com/huggingface/transformers.
  45. L. Ma, W. Jiang, Z. Jie, Y.-G. Jiang, and W. Liu, “Matching image and sentence with multi-faceted representations,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, pp. 2250–2261, 2020.
  46. J. Wei, Y. Yang, X. Xu, X. Zhu, and H. T. Shen, “Universal weighting metric learning for cross-modal retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 01, pp. 1–1, 2021.
  47. K. Wen, X. Gu, and Q. Cheng, “Learning dual semantic relations with graph attention for image-text matching,” IEEE transactions on circuits and systems for video technology, vol. 31, no. 7, pp. 2866–2879, 2020.
  48. Z. Ji, H. Wang, J. Han, and Y. Pang, “Sman: Stacked multimodal attention network for cross-modal image–text retrieval,” IEEE transactions on cybernetics, vol. 52, no. 2, pp. 1086–1097, 2020.
  49. L. Qu, M. Liu, D. Cao, L. Nie, and Q. Tian, “Context-aware multi-view summarization network for image-text matching,” in Proceedings of the ACM international conference on multimedia, 2020, pp. 1047–1055.
  50. Z. Huang, G. Niu, X. Liu, W. Ding, X. Xiao, H. Wu, and X. Peng, “Learning with noisy correspondence for cross-modal matching,” Advances in Neural Information Processing Systems, vol. 34, pp. 29 406–29 419, 2021.
  51. S. Yan, L. Yu, and Y. Xie, “Discrete-continuous action space policy gradient-based attention for image-text matching,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8096–8105.
  52. Y. Cheng, X. Zhu, J. Qian, F. Wen, and P. Liu, “Cross-modal graph matching network for image-text retrieval,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 18, no. 4, pp. 1–23, 2022.
  53. K. Zhang, Z. Mao, A.-A. Liu, and Y. Zhang, “Unified adaptive relevance distinguishable attention network for image-text matching,” IEEE Transactions on Multimedia, vol. 25, pp. 1320–1332, 2022.
  54. S. Ren, J. Lin, G. Zhao, R. Men, A. Yang, J. Zhou, X. Sun, and H. Yang, “Learning relation alignment for calibrated cross-modal retrieval,” in Proceedings of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing.   Association for Computational Linguistics, 2021, pp. 514–524.
  55. Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu, “Large-scale adversarial training for vision-and-language representation learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 6616–6628, 2020.
  56. F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang, “Ernie-vil: Knowledge enhanced vision-language representations through scene graphs,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 3208–3216.
  57. J. Chen, H. Hu, H. Wu, Y. Jiang, and C. Wang, “Learning the best pooling strategy for visual semantic embedding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 789–15 798.
  58. Z. Huang, Z. Zeng, Y. Huang, B. Liu, D. Fu, and J. Fu, “Seeing out of the box: End-to-end pre-training for vision-language representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 976–12 985.
  59. H. Xue, Y. Huang, B. Liu, H. Peng, J. Fu, H. Li, and J. Luo, “Probing inter-modality: Visual parsing with self-attention for vision-and-language pre-training,” Advances in Neural Information Processing Systems, vol. 34, pp. 4514–4528, 2021.
  60. W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5583–5594.
  61. Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu, “Pixel-bert: Aligning image pixels with text by deep multi-modal transformers,” arXiv preprint arXiv:2004.00849, 2020.
  62. S. Sun, Y.-C. Chen, L. Li, S. Wang, Y. Fang, and J. Liu, “Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval,” in Conference of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 982–997.
  63. J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021.
  64. J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 671–15 680.
  65. M. Cheng, Y. Sun, L. Wang, X. Zhu, K. Yao, J. Chen, G. Song, J. Han, J. Liu, E. Ding et al., “Vista: Vision and scene text aggregation for cross-modal retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5184–5193.
  66. T. Yu, H. Fei, and P. Li, “Cross-probe bert for fast cross-modal search,” in International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 2178–2183.
  67. X. Pan, T. Ye, D. Han, S. Song, and G. Huang, “Contrastive language-image pre-training with knowledge graphs,” in Advances in Neural Information Processing Systems, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chong Liu (104 papers)
  2. Yuqi Zhang (54 papers)
  3. Hongsong Wang (25 papers)
  4. Weihua Chen (35 papers)
  5. Fan Wang (312 papers)
  6. Yan Huang (180 papers)
  7. Yi-Dong Shen (12 papers)
  8. Liang Wang (512 papers)
Citations (16)