MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching (2402.17237v2)
Abstract: Existing two-stream models, such as CLIP, encode images and text through independent representations, showing good performance while ensuring retrieval speed, have attracted attention from industry and academia. However, the single representation often struggles to capture complex content fully. Such models may ignore fine-grained information during matching, resulting in suboptimal retrieval results. To overcome this limitation and enhance the performance of two-stream models, we propose a Multi-view Attention Method (MVAM) for image-text matching. This approach leverages diverse attention heads with unique view codes to learn multiple representations for images and text, which are then concatenated for matching. We also incorporate a diversity objective to explicitly encourage attention heads to focus on distinct aspects of the input data, capturing complementary fine-grained details. This diversity enables the model to represent image-text pairs from multiple perspectives, ensuring a more comprehensive understanding and alignment of critical content. Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance. Our experiments on MSCOCO and Flickr30K demonstrate enhancements over existing models, and further case studies reveal that different attention heads can focus on distinct content, achieving more comprehensive representations.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086.
- Uniter: Learning universal image-text representations.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612.
- Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969.
- Wenlan: Bridging vision and language by large-scale multi-modal pre-training. arXiv preprint arXiv:2103.06561.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR.
- Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137.
- Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
- Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201–216.
- Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11336–11344.
- Diversity regularized spatiotemporal attention for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 369–378.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
- A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- Visualsparta: An embarrassingly simple approach to large-scale text-to-image search with weighted bag-of-words. arXiv preprint arXiv:2101.00265.
- Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics, 9:329–345.
- Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
- Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1979–1988.
- Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 982–997.
- Camp: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5764–5773.
- Visual transformers: Token-based image representation and processing for computer vision.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
- Context-aware attention network for image-text retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3536–3545.