EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning (2410.17810v2)
Abstract: Recent advancements in image-text matching have been notable, yet prevailing models predominantly cater to broad queries and struggle with accommodating fine-grained query intention. In this paper, we work towards the \textbf{E}ntity-centric \textbf{I}mage-\textbf{T}ext \textbf{M}atching (EITM), a task that the text and image involve specific entity-related information. The challenge of this task mainly lies in the larger semantic gap in entity association modeling, comparing with the general image-text matching problem.To narrow the huge semantic gap between the entity-centric text and the images, we take the fundamental CLIP as the backbone and devise a multimodal attentive contrastive learning framework to tam CLIP to adapt EITM problem, developing a model named EntityCLIP. The key of our multimodal attentive contrastive learning is to generate interpretive explanation text using LLMs as the bridge clues. In specific, we proceed by extracting explanatory text from off-the-shelf LLMs. This explanation text, coupled with the image and text, is then input into our specially crafted Multimodal Attentive Experts (MMAE) module, which effectively integrates explanation texts to narrow the gap of the entity-related text and image in a shared semantic space. Building on the enriched features derived from MMAE, we further design an effective Gated Integrative Image-text Matching (GI-ITM) strategy. The GI-ITM employs an adaptive gating mechanism to aggregate MMAE's features, subsequently applying image-text matching constraints to steer the alignment between the text and the image. Extensive experiments are conducted on three social media news benchmarks including N24News, VisualNews, and GoodNews, the results shows that our method surpasses the competition methods with a clear margin.
- Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. CoRR abs/2308.12966 (2023). https://doi.org/10.48550/ARXIV.2308.12966 arXiv:2308.12966
- Good News, Everyone! Context Driven Entity-Aware Captioning for News Images. In IEEE Conference on Computer Vision and Pattern Recognition. 12466–12475. https://doi.org/10.1109/CVPR.2019.01275
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
- VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In British Machine Vision Conference. 12. http://bmvc2018.org/contents/papers/0344.pdf
- Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition. 770–778. https://doi.org/10.1109/CVPR.2016.90
- ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback. CoRR abs/2404.00934 (2024). https://doi.org/10.48550/ARXIV.2404.00934 arXiv:2404.00934
- Mistral 7B. CoRR abs/2310.06825 (2023). https://doi.org/10.48550/ARXIV.2310.06825 arXiv:2310.06825
- Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018 - 15th European Conference (Lecture Notes in Computer Science, Vol. 11208), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). 212–228. https://doi.org/10.1007/978-3-030-01225-0_13
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). 19730–19742. https://proceedings.mlr.press/v202/li23q.html
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). 12888–12900. https://proceedings.mlr.press/v162/li22n.html
- Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 9694–9705. https://proceedings.neurips.cc/paper/2021/hash/505259756244493872b7709a8a01b536-Abstract.html
- Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference (Lecture Notes in Computer Science, Vol. 8693), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
- Visual News: Benchmark and Challenges in News Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). 6761–6771. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.542
- EDIS: Entity-Driven Image Search over Multimodal Web Content. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). 4877–4894. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.297
- Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models. CoRR abs/2304.01852 (2023). https://doi.org/10.48550/ARXIV.2304.01852 arXiv:2304.01852
- Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). 8748–8763. http://proceedings.mlr.press/v139/radford21a.html
- Improving language understanding by generative pre-training. (2018).
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023). https://doi.org/10.48550/ARXIV.2302.13971 arXiv:2302.13971
- Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/ARXIV.2307.09288 arXiv:2307.09288
- Attention is All you Need. In Advances in Neural Information Processing Systems, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- PFAN++: Bi-Directional Image-Text Retrieval With Position Focused Attention Network. IEEE Trans. Multim. 23 (2021), 3362–3376. https://doi.org/10.1109/TMM.2020.3024822
- Position Focused Attention Network for Image-Text Matching. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Sarit Kraus (Ed.). 3792–3798. https://doi.org/10.24963/IJCAI.2019/526
- N24News: A New Dataset for Multimodal News Classification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). 6768–6775. https://aclanthology.org/2022.lrec-1.729
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics 2 (2014), 67–78. https://doi.org/10.1162/TACL_A_00166
- Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. In International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). 25994–26009. https://proceedings.mlr.press/v162/zeng22c.html
- X$^{2}$2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 46, 5 (2024), 3156–3168. https://doi.org/10.1109/TPAMI.2023.3339661
- Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks. In Findings of the Association for Computational Linguistics, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). 551–568. https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.40
- LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report. CoRR abs/2405.00732 (2024). https://doi.org/10.48550/ARXIV.2405.00732 arXiv:2405.00732
- Dual-path Convolutional Image-Text Embeddings with Instance Loss. ACM Trans. Multim. Comput. Commun. Appl. 16, 2 (2020), 51:1–51:23. https://doi.org/10.1145/3383184
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.