Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions (2405.19226v1)

Published 29 May 2024 in cs.CV and cs.MM

Abstract: Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text. Despite the success of VLMs, they still significantly lag behind human performance in IRCD. The main challenges lie in aligning key contextual cues in two modalities, where these subtle cues are concealed in tiny areas of multiple contrastive images and within the complex linguistics of textual descriptions. This motivates us to propose ContextBLIP, a simple yet effective method that relies on a doubly contextual alignment scheme for challenging IRCD. Specifically, 1) our model comprises a multi-scale adapter, a matching loss, and a text-guided masking loss. The adapter learns to capture fine-grained visual cues. The two losses enable iterative supervision for the adapter, gradually highlighting the focal patches of a single image to the key textual cues. We term such a way as intra-contextual alignment. 2) Then, ContextBLIP further employs an inter-context encoder to learn dependencies among candidates, facilitating alignment between the text to multiple images. We term this step as inter-contextual alignment. Consequently, the nuanced cues concealed in each modality can be effectively aligned. Experiments on two benchmarks show the superiority of our method. We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
  2. Vl-beit: Generative vision-language pretraining. arXiv preprint arXiv:2206.01127.
  3. Uniter: Universal image-text representation learning. In ECCV, pages 104–120.
  4. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  7. Data filtering networks.
  8. Jerry Fodor. 2001. Language, thought and compositionality. Royal Institute of Philosophy Supplement, 48:227–242.
  9. Clip-adapter: Better vision-language models with feature adapters. ArXiv, abs/2110.04544.
  10. Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204.
  11. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009.
  12. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  13. Parameter-efficient transfer learning for NLP. In ICML, pages 2790–2799.
  14. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916.
  15. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  16. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73.
  17. Image retrieval from contextual descriptions. arXiv preprint arXiv:2203.15867.
  18. Masked vision and language modeling for multi-modal representation learning. arXiv preprint arXiv:2208.02131.
  19. Gradient-based learning applied to document recognition. IEEE, 86(11):2278–2324.
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  21. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pages 12888–12900.
  22. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, volume 34, pages 9694–9705.
  23. A neural divide-and-conquer reasoning framework for image retrieval from linguistically complex text. arXiv preprint arXiv:2305.02265.
  24. Microsoft coco: Common objects in context. In ECCV, pages 740–755.
  25. Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling. arXiv preprint arXiv:2302.06605.
  26. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, volume 32.
  27. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV.
  28. Learning transferable visual models from natural language supervision. In ICML, volume 139, pages 8748–8763.
  29. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389.
  30. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209.
  31. Masked image modeling with local multi-scale reconstruction. In CVPR, pages 2122–2131.
  32. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, volume 162, pages 23318–23340.
  33. Demystifying clip data. arXiv preprint arXiv:2309.16671.
  34. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1.
  35. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343.
  36. Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling.

Summary

We haven't generated a summary for this paper yet.