Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is CLIP the main roadblock for fine-grained open-world perception? (2404.03539v1)

Published 4 Apr 2024 in cs.CV

Abstract: Modern applications increasingly demand flexible computer vision models that adapt to novel concepts not encountered during training. This necessity is pivotal in emerging domains like extended reality, robotics, and autonomous driving, which require the ability to respond to open-world stimuli. A key ingredient is the ability to identify objects based on free-form textual queries defined at inference time - a task known as open-vocabulary object detection. Multimodal backbones like CLIP are the main enabling technology for current open-world perception solutions. Despite performing well on generic queries, recent studies highlighted limitations on the fine-grained recognition capabilities in open-vocabulary settings - i.e., for distinguishing subtle object features like color, shape, and material. In this paper, we perform a detailed examination of these open-vocabulary object recognition limitations to find the root cause. We evaluate the performance of CLIP, the most commonly used vision-language backbone, against a fine-grained object-matching benchmark, revealing interesting analogies between the limitations of open-vocabulary object detectors and their backbones. Experiments suggest that the lack of fine-grained understanding is caused by the poor separability of object characteristics in the CLIP latent space. Therefore, we try to understand whether fine-grained knowledge is present in CLIP embeddings but not exploited at inference time due, for example, to the unsuitability of the cosine similarity matching function, which may discard important object characteristics. Our preliminary experiments show that simple CLIP latent-space re-projections help separate fine-grained concepts, paving the way towards the development of backbones inherently able to process fine-grained details. The code for reproducing these experiments is available at https://github.com/lorebianchi98/FG-CLIP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Three ways to improve feature alignment for open vocabulary detection. arXiv preprint arXiv:2303.13518, 2023.
  2. The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding. In [Accepted for publication in CVPR 2024], 2024.
  3. Open-vocabulary attribute detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7041–7050, 2023.
  4. Ovarnet: Towards open-vocabulary object attribute recognition. In CVPR, 2023.
  5. Epic-kitchens visor benchmark: Video segmentations and object relations. In Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  7. Learning to prompt for open-vocabulary object detection with vision-language model. pages 14084–14093, 2022.
  8. Vse++: Improving visual-semantic embeddings with hard negatives. 2018.
  9. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  10. Open-vocabulary object detection via vision and language knowledge distillation. 2022.
  11. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  12. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  13. Image retrieval from contextual descriptions. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3426–3440, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  14. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
  15. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4654–4662, 2019.
  16. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer, 2020.
  17. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  18. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(4):1–23, 2021.
  19. Transformer reasoning network for image-text matching and retrieval. In 2020 25th International conference on pattern recognition (ICPR), pages 5222–5229. IEEE, 2021.
  20. Aladin: Distilling fine-grained alignment scores for efficient image-text matching and retrieval. In International Conference on Content-based Multimedia Indexing, pages 64–70, 2022.
  21. Scaling open-vocabulary object detection. 2023.
  22. Simple open-vocabulary object detection. pages 728–755. Springer, 2022.
  23. Teaching clip to count to ten. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3170–3180, 2023.
  24. Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. arXiv preprint arXiv:2112.07566, 2021.
  25. Context-aware multi-view summarization network for image-text matching. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1047–1055, 2020.
  26. Learning transferable visual models from natural language supervision. pages 8748–8763. PMLR, 2021.
  27. Perceptual grouping in contrastive vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5571–5584, 2023.
  28. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5814–5824, 2019.
  29. A novel attention-based aggregation function to combine vision and language. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 1212–1219. IEEE, 2021.
  30. Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, 2020.
  31. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  32. Learning dual semantic relations with graph attention for image-text matching. IEEE transactions on circuits and systems for video technology, 31(7):2866–2879, 2020.
  33. CORA: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching. pages 7031–7040, 2023.
  34. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2022.
  35. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2021.
  36. RegionCLIP: Region-based language-image pretraining. pages 16793–16803, 2022.
  37. Detecting twenty-thousand classes using image-level supervision. pages 350–368. Springer, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Lorenzo Bianchi (40 papers)
  2. Fabio Carrara (16 papers)
  3. Nicola Messina (23 papers)
  4. Fabrizio Falchi (58 papers)
Citations (3)