Probabilistic Language-Image Pre-Training (2410.18857v2)
Abstract: Vision-LLMs (VLMs) embed aligned image-text pairs into a joint space but often rely on deterministic embeddings, assuming a one-to-one correspondence between images and texts. This oversimplifies real-world relationships, which are inherently many-to-many, with multiple captions describing a single image and vice versa. We introduce Probabilistic Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained on a billion-scale image-text dataset using only probabilistic objectives, achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an "uncertainty token" without extra parameters. We also introduce a novel inclusion loss that enforces distributional inclusion relationships between image-text pairs and between original and masked inputs. Experiments demonstrate that, by leveraging uncertainty estimates, ProLIP benefits downstream tasks and aligns with intuitive notions of uncertainty, e.g., shorter texts being more uncertain and more general inputs including specific ones. Utilizing text uncertainties, we further improve ImageNet accuracy from 74.6% to 75.8% (under a few-shot setting), supporting the practical advantages of our probabilistic approach. The code is available at https://github.com/naver-ai/prolip
- Emergent visual-semantic hierarchies in image-text representations. In European Conference on Computer Vision (ECCV), 2024.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
- Winogavil: Gamified association benchmark to challenge vision-and-language models. Advances in Neural Information Processing Systems, 35:26549–26564, 2022.
- Data uncertainty learning in face recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 5710–5719, 2020.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3558–3568, 2021.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Sanghyuk Chun. Improved probabilistic image-text representations. In International Conference on Learning Representations (ICLR), 2024.
- Probabilistic embeddings for cross-modal retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., 2021.
- Eccv caption: Correcting false negatives by collecting machine-and-human-verified image-caption associations for ms-coco. In Eur. Conf. Comput. Vis., 2022.
- RedCaps: Web-curated image-text data created by the people, for the people. In NeurIPS Datasets and Benchmarks, 2021.
- Hyperbolic image-text representations. In International Conference on Machine Learning, pp. 7694–7731. PMLR, 2023.
- Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. Learn. Represent., 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349, 2021a.
- Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271, 2021b.
- Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. In Int. Conf. Learn. Represent., 2021.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Map: Multimodal uncertainty-aware vision-language pre-training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23262–23271, 2023.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Int. Conf. Mach. Learn., pp. 4904–4916. PMLR, 2021.
- Hype: Hyperbolic entailment filtering for underspecified images and texts. In European Conference on Computer Vision (ECCV), 2024.
- Adam: A method for stochastic optimization. In Int. Conf. Learn. Represent., 2015.
- Probabilistic contrastive learning recovers the correct aleatoric uncertainty of ambiguous inputs. In International Conference on Machine Learning, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
- Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23390–23400, 2023.
- Microsoft coco: Common objects in context. In Eur. Conf. Comput. Vis., 2014.
- Probabilistic compositional embeddings for multimodal image retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 4547–4557, 2022.
- A mixture model for learning multi-sense word embeddings. In Proc. of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), pp. 121–127, 2017.
- Modeling uncertainty with hedged instance embedding. In Int. Conf. Learn. Represent., 2019.
- Probabilistic representations for video contrastive learning. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 14711–14721, 2022.
- Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., pp. 8748–8763. PMLR, 2021.
- Do ImageNet classifiers generalize to ImageNet? In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5389–5400. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/recht19a.html.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Association for Computational Linguistics, pp. 2556–2565, 2018.
- Probabilistic face embeddings. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 6902–6911, 2019.
- Probabilistic embeddings for speaker diarization. In Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 24–31, 2020.
- View-invariant probabilistic embedding for human pose. In Eur. Conf. Comput. Vis., 2020.
- Probvlm: Probabilistic adapter for frozen vison-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1899–1910, 2023.
- Attention is all you need. In Adv. Neural Inform. Process. Syst., pp. 5998–6008, 2017.
- The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pp. 10506–10518, 2019.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Int. Conf. Comput. Vis., 2019.
- A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
- Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11975–11986, 2023.
- mixup: Beyond empirical risk minimization. In Int. Conf. Learn. Represent., 2018.
- Sanghyuk Chun (49 papers)
- Wonjae Kim (25 papers)
- Song Park (12 papers)
- Sangdoo Yun (71 papers)