Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Probabilistic Language-Image Pre-Training (2410.18857v2)

Published 24 Oct 2024 in cs.CV and cs.LG

Abstract: Vision-LLMs (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an "uncertainty token" without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, ProLIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve ImageNet accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Emergent visual-semantic hierarchies in image-text representations. In European Conference on Computer Vision (ECCV), 2024.
  2. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
  3. Winogavil: Gamified association benchmark to challenge vision-and-language models. Advances in Neural Information Processing Systems, 35:26549–26564, 2022.
  4. Data uncertainty learning in face recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  5710–5719, 2020.
  5. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  3558–3568, 2021.
  6. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  7. Sanghyuk Chun. Improved probabilistic image-text representations. In International Conference on Learning Representations (ICLR), 2024.
  8. Probabilistic embeddings for cross-modal retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., 2021.
  9. Eccv caption: Correcting false negatives by collecting machine-and-human-verified image-caption associations for ms-coco. In Eur. Conf. Comput. Vis., 2022.
  10. RedCaps: Web-curated image-text data created by the people, for the people. In NeurIPS Datasets and Benchmarks, 2021.
  11. Hyperbolic image-text representations. In International Conference on Machine Learning, pp. 7694–7731. PMLR, 2023.
  12. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. Learn. Represent., 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  14. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.
  15. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  16. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8340–8349, 2021a.
  17. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15262–15271, 2021b.
  18. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. In Int. Conf. Learn. Represent., 2021.
  19. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  20. Map: Multimodal uncertainty-aware vision-language pre-training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23262–23271, 2023.
  21. Scaling up visual and vision-language representation learning with noisy text supervision. In Int. Conf. Mach. Learn., pp.  4904–4916. PMLR, 2021.
  22. Hype: Hyperbolic entailment filtering for underspecified images and texts. In European Conference on Computer Vision (ECCV), 2024.
  23. Adam: A method for stochastic optimization. In Int. Conf. Learn. Represent., 2015.
  24. Probabilistic contrastive learning recovers the correct aleatoric uncertainty of ambiguous inputs. In International Conference on Machine Learning, 2023.
  25. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
  26. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23390–23400, 2023.
  27. Microsoft coco: Common objects in context. In Eur. Conf. Comput. Vis., 2014.
  28. Probabilistic compositional embeddings for multimodal image retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  4547–4557, 2022.
  29. A mixture model for learning multi-sense word embeddings. In Proc. of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), pp.  121–127, 2017.
  30. Modeling uncertainty with hedged instance embedding. In Int. Conf. Learn. Represent., 2019.
  31. Probabilistic representations for video contrastive learning. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  14711–14721, 2022.
  32. Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., pp.  8748–8763. PMLR, 2021.
  33. Do ImageNet classifiers generalize to ImageNet? In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5389–5400. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/recht19a.html.
  34. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  35. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  36. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Association for Computational Linguistics, pp. 2556–2565, 2018.
  37. Probabilistic face embeddings. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  6902–6911, 2019.
  38. Probabilistic embeddings for speaker diarization. In Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pp.  24–31, 2020.
  39. View-invariant probabilistic embedding for human pose. In Eur. Conf. Comput. Vis., 2020.
  40. Probvlm: Probabilistic adapter for frozen vison-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1899–1910, 2023.
  41. Attention is all you need. In Adv. Neural Inform. Process. Syst., pp.  5998–6008, 2017.
  42. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  43. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pp. 10506–10518, 2019.
  44. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  45. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Int. Conf. Comput. Vis., 2019.
  46. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
  47. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  11975–11986, 2023.
  48. mixup: Beyond empirical risk minimization. In Int. Conf. Learn. Represent., 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sanghyuk Chun (49 papers)
  2. Wonjae Kim (25 papers)
  3. Song Park (12 papers)
  4. Sangdoo Yun (71 papers)

Summary

Overview of Probabilistic Language-Image Pre-Training (ProLIP)

Probabilistic Language-Image Pre-Training (ProLIP) represents an innovative approach to tackling the inherent ambiguities present in vision-LLMs (VLMs). Traditional VLMs, such as CLIP, rely on deterministic embeddings where images and corresponding text descriptions are mapped into a joint latent space. However, in real-world scenarios, the relationship between images and text is inherently many-to-many, where multiple descriptions might correspond to a single image and vice versa. This paper introduces ProLIP, the first probabilistic VLM pre-trained using probabilistic objectives on a large-scale image-text dataset, demonstrating substantial zero-shot capabilities.

ProLIP diverges from previous deterministic models by utilizing probabilistic embeddings, which are more adept at representing the uncertainty and variability present in image-text pairs. It introduces an efficient mechanism to estimate uncertainty using an "uncertainty token" ([UNC]) without necessitating extra model parameters—a distinct advantage over earlier probabilistic models that required additional modules for uncertainty estimation.

Key Contributions

ProLIP offers several key contributions:

  1. Probabilistic Embeddings with [UNC] Token: The model utilizes a novel architecture where an [UNC] token is incorporated to effectively estimate uncertainty as diagonal Gaussian variances, requiring minimal computational overhead compared to deterministic approaches.
  2. Inclusion Loss: The paper introduces a novel inclusion loss that enforces probabilistic inclusion relationships between image-text pairs, facilitating better alignment with human-like interpretability. The inclusion loss aids in learning embeddings that naturally align with hierarchical notions of inclusion, which are intuitively understood in probabilistic terms.
  3. Probabilistic Pairwise Contrastive Loss (PPCL): This new loss function employs a simplified log-sigmoid formulation to stabilize training, circumventing issues typical of probabilistic matching losses like those used in previous works (++ models).
  4. Efficient Training and Scalability: ProLIP is capable of being trained from scratch and achieves competitive zero-shot performance without fine-tuning.
  5. Zero-shot and Few-shot Performance: ProLIP exhibits strong zero-shot performance, achieving a 74.6% accuracy on ImageNet with ViT-B/16, and further enhancements are observed when leveraging uncertainty for prompt re-weighting in few-shot settings.

Practical and Theoretical Implications

From a theoretical standpoint, ProLIP advances our understanding of probabilistic representation learning by demonstrating how probabilistic embeddings can be effectively integrated into VL models to better capture data ambiguity. This framework aligns closely with real-world scenarios where multiple valid interpretations exist.

Practically, the model's ability to efficiently handle uncertainty provides tangible benefits in downstream tasks, such as improved zero-shot classification. The architecture is designed to integrate seamlessly into existing computational setups due to its efficiency, promising adoption in large-scale applications where both computation and storage are at a premium.

Future Directions

The probabilistic nature of ProLIP opens up several avenues for future exploration:

  • Exploration of Alternative Probability Distributions: While Gaussian distributions are effective, exploring alternative families such as von Mises-Fisher distributions could yield insights into more compact representations.
  • Broader Applications: Extending ProLIP to other domains where many-to-many associations exist, such as genomics or multimodal healthcare data, could provide significant breakthroughs.
  • Continual Learning Frameworks: Integrating ProLIP with continual learning paradigms where uncertainty estimation could guide the selection of relevant data for model updates.

In conclusion, ProLIP represents a significant forward step in the evolution of VLMs, aligning probabilistic modeling with the multidimensional variability found in real-world data. The methodology sets a new benchmark for handling uncertainty in machine learning models, with implications that extend beyond the immediate scope of vision-language tasks.