Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 83 tok/s
Gemini 2.5 Pro 34 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 130 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning (2410.17810v2)

Published 23 Oct 2024 in cs.CV

Abstract: Recent advancements in image-text matching have been notable, yet prevailing models predominantly cater to broad queries and struggle with accommodating fine-grained query intention. In this paper, we work towards the \textbf{E}ntity-centric \textbf{I}mage-\textbf{T}ext \textbf{M}atching (EITM), a task that the text and image involve specific entity-related information. The challenge of this task mainly lies in the larger semantic gap in entity association modeling, comparing with the general image-text matching problem.To narrow the huge semantic gap between the entity-centric text and the images, we take the fundamental CLIP as the backbone and devise a multimodal attentive contrastive learning framework to tam CLIP to adapt EITM problem, developing a model named EntityCLIP. The key of our multimodal attentive contrastive learning is to generate interpretive explanation text using LLMs as the bridge clues. In specific, we proceed by extracting explanatory text from off-the-shelf LLMs. This explanation text, coupled with the image and text, is then input into our specially crafted Multimodal Attentive Experts (MMAE) module, which effectively integrates explanation texts to narrow the gap of the entity-related text and image in a shared semantic space. Building on the enriched features derived from MMAE, we further design an effective Gated Integrative Image-text Matching (GI-ITM) strategy. The GI-ITM employs an adaptive gating mechanism to aggregate MMAE's features, subsequently applying image-text matching constraints to steer the alignment between the text and the image. Extensive experiments are conducted on three social media news benchmarks including N24News, VisualNews, and GoodNews, the results shows that our method surpasses the competition methods with a clear margin.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. CoRR abs/2308.12966 (2023). https://doi.org/10.48550/ARXIV.2308.12966 arXiv:2308.12966
  2. Good News, Everyone! Context Driven Entity-Aware Captioning for News Images. In IEEE Conference on Computer Vision and Pattern Recognition. 12466–12475. https://doi.org/10.1109/CVPR.2019.01275
  3. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
  4. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations. https://openreview.net/forum?id=YicbFdNTTy
  5. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In British Machine Vision Conference. 12. http://bmvc2018.org/contents/papers/0344.pdf
  6. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition. 770–778. https://doi.org/10.1109/CVPR.2016.90
  7. ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback. CoRR abs/2404.00934 (2024). https://doi.org/10.48550/ARXIV.2404.00934 arXiv:2404.00934
  8. Mistral 7B. CoRR abs/2310.06825 (2023). https://doi.org/10.48550/ARXIV.2310.06825 arXiv:2310.06825
  9. Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018 - 15th European Conference (Lecture Notes in Computer Science, Vol. 11208), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). 212–228. https://doi.org/10.1007/978-3-030-01225-0_13
  10. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). 19730–19742. https://proceedings.mlr.press/v202/li23q.html
  11. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). 12888–12900. https://proceedings.mlr.press/v162/li22n.html
  12. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). 9694–9705. https://proceedings.neurips.cc/paper/2021/hash/505259756244493872b7709a8a01b536-Abstract.html
  13. Microsoft COCO: Common Objects in Context. In Computer Vision - ECCV 2014 - 13th European Conference (Lecture Notes in Computer Science, Vol. 8693), David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
  14. Visual News: Benchmark and Challenges in News Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). 6761–6771. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.542
  15. EDIS: Entity-Driven Image Search over Multimodal Web Content. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). 4877–4894. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.297
  16. Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models. CoRR abs/2304.01852 (2023). https://doi.org/10.48550/ARXIV.2304.01852 arXiv:2304.01852
  17. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). 8748–8763. http://proceedings.mlr.press/v139/radford21a.html
  18. Improving language understanding by generative pre-training. (2018).
  19. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  20. LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971 (2023). https://doi.org/10.48550/ARXIV.2302.13971 arXiv:2302.13971
  21. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/ARXIV.2307.09288 arXiv:2307.09288
  22. Attention is All you Need. In Advances in Neural Information Processing Systems, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  23. PFAN++: Bi-Directional Image-Text Retrieval With Position Focused Attention Network. IEEE Trans. Multim. 23 (2021), 3362–3376. https://doi.org/10.1109/TMM.2020.3024822
  24. Position Focused Attention Network for Image-Text Matching. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Sarit Kraus (Ed.). 3792–3798. https://doi.org/10.24963/IJCAI.2019/526
  25. N24News: A New Dataset for Multimodal News Classification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). 6768–6775. https://aclanthology.org/2022.lrec-1.729
  26. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics 2 (2014), 67–78. https://doi.org/10.1162/TACL_A_00166
  27. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. In International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (Eds.). 25994–26009. https://proceedings.mlr.press/v162/zeng22c.html
  28. X$^{2}$2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks. IEEE Trans. Pattern Anal. Mach. Intell. 46, 5 (2024), 3156–3168. https://doi.org/10.1109/TPAMI.2023.3339661
  29. Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks. In Findings of the Association for Computational Linguistics, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). 551–568. https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.40
  30. LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report. CoRR abs/2405.00732 (2024). https://doi.org/10.48550/ARXIV.2405.00732 arXiv:2405.00732
  31. Dual-path Convolutional Image-Text Embeddings with Instance Loss. ACM Trans. Multim. Comput. Commun. Appl. 16, 2 (2020), 51:1–51:23. https://doi.org/10.1145/3383184

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.