Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models (2402.15021v2)

Published 22 Feb 2024 in cs.CV and cs.CL
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

Abstract: Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-LLMs (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.

Enhancing Compositionality in Contrastive Vision-LLMs with CLoVe

Introduction to CLoVe Framework

The integration of Vision and LLMs (VLMs) has achieved notable advancements in tasks requiring an understanding of both textual and visual inputs. Models like CLIP have demonstrated their adeptness in object recognition but have struggled with handling compositional language, indicating a need for models that can interpret complex concepts by understanding the composition of simpler concepts. The paper introduces a novel framework, CLoVe, aiming to significantly enhance the compositional language encoding capabilities of existing contrastive VLMs without compromising their performance on standard benchmarks.

Examining the Challenge of Compositionality

Various benchmarks have established that even highly sophisticated models like GPT-4V fail to grasp compositional nuances effectively. Previous attempts to imbue VLMs with compositional understanding (e.g., NegCLIP and REPLACE) have unfortunately led to a decrease in object recognition accuracy. CLoVe addresses this issue by improving upon the compositionality of models through a multi-faceted approach that includes data curation, the inclusion of hard negatives, and model patching, showcasing over 10% absolute improvement on compositionality benchmarks.

CLoVe Framework Detailed

Synthetic Captions

The CLoVe framework enriches training data with high-quality synthetic captions generated from a vast dataset, maintaining a balance between data volume and annotation quality. This approach counters the drawbacks of using smaller, though high-quality, datasets like COCO, which may not cover a wide array of objects and actions.

Hard Negatives

In integrating hard negative texts during the training phase, CLoVe sharpens a model's understanding of language composition. By employing carefully crafted hard negatives, the model is trained to discern subtle nuances in word arrangement and contextual usage, substantially improving its compositionality skills.

Model Patching

A critical innovation within CLoVe is the use of model patching, designed to retain the pre-trained model’s original performance on standard benchmarks while integrating enhanced compositionality. This step amalgamates the strengths of the fine-tuned model with the foundational capabilities of the original model, addressing the trade-off observed in previous methodologies.

Empirical Validation

The efficacy of the CLoVe framework was demonstrated through a comprehensive evaluation involving a series of ablation studies and comparisons against baseline models. The use of synthetic captions, the inclusion of hard negatives, and strategic model patching collectively contributed to noteworthy improvements across both compositionality and standard benchmarks. For instance, applying CLoVe to CLIP not only improved its compositional understanding as measured by benchmarks like SugarCrepe but also maintained its proficiency in object recognition tasks, such as ImageNet.

Looking Forward

While CLoVe marks a significant step towards rectifying compositionality in VLMs, the journey towards models that can fully comprehend and generate compositional language continues. Future efforts could explore refining synthetic caption generation, addressing potential biases in model performance across different demographics, and extending these techniques to single-tower models. The release of code and pre-trained models opens avenues for further research and application, fostering advancements in the field of Vision-LLMing.

Concluding Thoughts

In summary, the CLoVe framework represents a substantial advancement in encoding compositional language within contrastive VLMs. By overcoming the existing trade-offs between compositionality and object-centric recognition accuracy, CLoVe sets a new precedent for future developments in the integration of vision and language understanding in AI models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736. Curran Associates, Inc.
  2. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  3. Improving image generation with better captions.
  4. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc.
  5. On the opportunities and risks of foundation models. ArXiv.
  6. COYO-700M: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset.
  7. Santiago Castro and Fabian Caba. 2022. Fitclip: Refining large-scale pretrained image-text models for zero-shot video understanding tasks. In 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press.
  8. Scalable performance analysis for vision-language models. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 284–294, Toronto, Canada. Association for Computational Linguistics.
  9. Microsoft COCO Captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  10. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829.
  11. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
  12. tqdm: A fast, Extensible Progress Bar for Python and CLI.
  13. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.
  14. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  15. Why is winoground hard? investigating failures in visuolinguistic compositionality. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2236–2250, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  16. Christiane Fellbaum. 2010. Theory and Applications of Ontology: Computer Applications, chapter WordNet. Springer Netherlands, Dordrecht.
  17. Compositionality in visual perception. Behavioral and Brain Sciences, 46:e277.
  18. Array programming with NumPy. Nature, 585(7825):357–362.
  19. Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pages 204–207. IEEE.
  20. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.
  21. Lisa Anne Hendricks and Aida Nematzadeh. 2021. Probing image-language transformers for verb understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3635–3644, Online. Association for Computational Linguistics.
  22. spaCy: Industrial-strength Natural Language Processing in Python.
  23. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  24. John D Hunter. 2007. Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(03):90–95.
  25. Patching open-vocabulary models by interpolating weights. In Advances in Neural Information Processing Systems, volume 35, pages 29262–29277. Curran Associates, Inc.
  26. Openclip. If you use this software, please cite it as below.
  27. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR.
  28. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  29. Jupyter Notebooks – a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87–90, Netherlands. IOS Press.
  30. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops.
  31. Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto.
  32. HMDB: A large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563.
  33. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  34. The MNIST database of handwritten digits.
  35. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  36. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Fortieth International Conference on Machine Learning.
  37. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR.
  38. Microsoft COCO: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham. Springer International Publishing.
  39. Ilya Loshchilov and Frank Hutter. 2017. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations.
  40. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  41. Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10910–10921.
  42. TorchVision: PyTorch’s computer vision library. https://github.com/pytorch/vision.
  43. RareAct: A video dataset of unusual interactions. arXiv preprint arXiv:2008.01018.
  44. Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729.
  45. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  46. OpenAI. 2023. GPT-4V(ision) System Card. Technical report, OpenAI.
  47. VALSE: A task-independent benchmark for vision and language models centered on linguistic phenomena. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280, Dublin, Ireland. Association for Computational Linguistics.
  48. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  49. Fernando Pérez and Brian E. Granger. 2007. IPython: a system for interactive scientific computing. Computing in Science and Engineering, 9(3):21–29.
  50. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  51. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
  52. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR.
  53. Cola: A benchmark for compositional text-to-image retrieval. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  54. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695.
  55. LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  56. LAION COCO: 600M synthetic captions from LAION2B-EN.
  57. LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
  58. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
  59. FOIL it! find one mismatch between image and language caption. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 255–265, Vancouver, Canada. Association for Computational Linguistics.
  60. UCF101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01.
  61. Robyn Speer. 2019. ftfy. Zenodo. Version 5.5.
  62. Ole Tange. 2011. GNU Parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47.
  63. The Pandas development team. 2023. pandas-dev/pandas: Pandas.
  64. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5238–5248.
  65. Image captioners are scalable vision learners too. In Thirty-seventh Conference on Neural Information Processing Systems.
  66. Captioning images with diverse objects. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  67. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods, 17(3):261–272.
  68. Michael L. Waskom. 2021. seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021.
  69. Ross Wightman. 2019. PyTorch image models. https://github.com/rwightman/pytorch-image-models.
  70. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  71. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  72. Omry Yadan. 2019. Hydra – a framework for elegantly configuring complex applications. Github.
  73. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
  74. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations.
  75. VL-CheckList: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221.
  76. Towards automatic learning of procedures from web instructional videos. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Santiago Castro (14 papers)
  2. Amir Ziai (11 papers)
  3. Avneesh Saluja (7 papers)
  4. Zhuoning Yuan (14 papers)
  5. Rada Mihalcea (131 papers)
Citations (3)
Reddit Logo Streamline Icon: https://streamlinehq.com