Enhancing Image-Text Matching with Adaptive Feature Aggregation (2401.09725v1)
Abstract: Image-text matching aims to find matched cross-modal pairs accurately. While current methods often rely on projecting cross-modal features into a common embedding space, they frequently suffer from imbalanced feature representations across different modalities, leading to unreliable retrieval results. To address these limitations, we introduce a novel Feature Enhancement Module that adaptively aggregates single-modal features for more balanced and robust image-text retrieval. Additionally, we propose a new loss function that overcomes the shortcomings of original triplet ranking loss, thereby significantly improving retrieval performance. The proposed model has been evaluated on two public datasets and achieves competitive retrieval performance when compared with several state-of-the-art models. Implementation codes can be found here.
- “Image captioning with semantic attention,” in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 4651–4659, IEEE.
- “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision. 2015, pp. 2425–2433, IEEE.
- “Stacked cross attention for image-text matching,” in Proceedings of the European conference on computer vision (ECCV). 2018, vol. 11208 of Lecture Notes in Computer Science, pp. 212–228, Springer.
- “Vse++: Improving visual-semantic embeddings with hard negatives,” in Proceedings of the British Machine Vision Conference. 2018, BMVA.
- “Multi-view visual semantic embedding,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI. 2022, pp. 1130–1136, IJCAI.
- “Learning the best pooling strategy for visual semantic embedding,” in IEEE Conference on Computer Vision and Pattern Recognition. 2021, pp. 15789–15798, IEEE.
- “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 2016, pp. 770–778, IEEE.
- Alex Sherstinsky, “Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network,” Physica D: Nonlinear Phenomena, vol. 404, pp. 132306, 2020.
- “Visual semantic reasoning for image-text matching,” in IEEE/CVF International Conference on Computer Vision. 2019, pp. 4653–4661, IEEE.
- Andrej Karpathy and Li Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 2015, pp. 3128–3137, IEEE.
- “Step-wise hierarchical alignment network for image-text matching,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI, Zhi-Hua Zhou, Ed. 2021, pp. 765–771, IJCAI.
- “Similarity reasoning and filtration for image-text matching,” in Proceedings of the AAAI conference on artificial intelligence. 2021, vol. 35, pp. 1218–1226, AAAI.
- “Unified loss of pair similarity optimization for vision-language retrieval,” CoRR, vol. abs/2209.13869, 2022.
- “Representation learning with contrastive predictive coding,” CoRR, vol. abs/1807.03748, 2018.
- “Intra-modal constraint loss for image-text retrieval,” in 2022 IEEE International Conference on Image Processing (ICIP). 2022, pp. 4023–4027, IEEE.
- “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, 2017, pp. 5998–6008.
- “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 2019, pp. 4171–4186, ACL.
- “Extreme learning machine for multilayer perceptron,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 4, pp. 809–821, 2016.
- “mixup: Beyond empirical risk minimization,” in 6th International Conference on Learning Representations, ICLR. 2018, ICLR.
- “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguistics, vol. 2, pp. 67–78, 2014.
- “Microsoft COCO: common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference. 2014, vol. 8693, pp. 740–755, ECCV.
- “Consensus-aware visual-semantic embedding for image-text matching,” in Computer Vision - ECCV 2020 - 16th European Conference. 2020, vol. 12369, pp. 18–34, ECCV.
- “Gradual: Graph-based dual-modal representation for image-text matching,” in IEEE/CVF Winter Conference on Applications of Computer Vision, WACV. 2022, pp. 2463–2472, IEEE.
- “Plug-and-play regulators for image-text matching,” IEEE Trans. Image Process., vol. 32, pp. 2322–2334, 2023.
- “Improving cross-modal retrieval with set of diverse embeddings,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. 2023, pp. 23422–23431, IEEE.
- Zuhui Wang (4 papers)
- Yunting Yin (5 papers)
- I. V. Ramakrishnan (7 papers)