Improved Probabilistic Image-Text Representations (2305.18171v5)
Abstract: Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; the burden of heavy computations due to the Monte Carlo approximation, and the loss saturation issue in the face of abundant false negatives. To overcome the issues, this paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution. In addition, two optimization techniques are proposed to enhance PCME++ further: first, the incorporation of pseudo-positives to prevent the negative effect under massive false negatives; second, mixed sample data augmentation for probabilistic matching. Experimental results on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. The robustness of PCME++ is also evaluated under noisy image-text correspondences. In addition, the potential applicability of PCME++ in automatic prompt-filtering for zero-shot classification is shown. The code is available at https://github.com/naver-ai/pcmepp
- Deep variational information bottleneck. In Int. Conf. Learn. Represent., 2017. URL https://openreview.net/forum?id=HyxQzBceg.
- Is an image worth five sentences? a new look into semantics for image-text matching. In IEEE/CVF Winter Conf. App. Comput. Vis., pp. 1391–1400, 2022.
- Swad: Domain generalization by seeking flat minima. In Adv. Neural Inform. Process. Syst., 2021.
- Data uncertainty learning in face recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 5710–5719, 2020.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3558–3568, 2021.
- Learning the best pooling strategy for visual semantic embedding. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 15789–15798, 2021.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Probabilistic embeddings for cross-modal retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., 2021.
- Eccv caption: Correcting false negatives by collecting machine-and-human-verified image-caption associations for ms-coco. In Eur. Conf. Comput. Vis., 2022.
- RedCaps: Web-curated image-text data created by the people, for the people. In NeurIPS Datasets and Benchmarks, 2021.
- Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. Learn. Represent., 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- VSE++: Improving visual-semantic embeddings with hard negatives. In Brit. Mach. Vis. Conf., 2018.
- Devise: A deep visual-semantic embedding model. In Adv. Neural Inform. Process. Syst., pp. 2121–2129, 2013.
- Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 7181–7189, 2018.
- Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. In Int. Conf. Learn. Represent., 2021.
- Learning semantic concepts and order for image and sentence matching. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 6163–6171, 2018.
- Learning with noisy correspondence for cross-modal matching. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Adv. Neural Inform. Process. Syst., 2021. URL https://openreview.net/forum?id=S9ZyhWC17wJ.
- Averaging weights leads to wider optima and better generalization. Conference on Uncertainty in Artificial Intelligence, 2018.
- Map: Multimodal uncertainty-aware vision-language pre-training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23262–23271, 2023.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3128–3137, 2015.
- Improving cross-modal retrieval with set of diverse embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23422–23431, 2023.
- Nsml: Meet the mlaas platform with a real-world case study. arXiv preprint arXiv:1810.09957, 2018.
- Vilt: Vision-and-language transformer without convolution or region supervision. In Int. Conf. Mach. Learn., 2021.
- Adam: A method for stochastic optimization. In Int. Conf. Learn. Represent., 2015.
- Probabilistic contrastive learning recovers the correct aleatoric uncertainty of ambiguous inputs. In International Conference on Machine Learning, 2023.
- Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
- Stacked cross attention for image-text matching. In Eur. Conf. Comput. Vis., 2018.
- A differentiable semantic metric approximation in probabilistic embedding for cross-modal retrieval. Advances in Neural Information Processing Systems, 35:11934–11946, 2022a.
- Dividemix: Learning with noisy labels as semi-supervised learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJgExaVtwr.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022b.
- Visual semantic reasoning for image-text matching. In Int. Conf. Comput. Vis., pp. 4654–4662, 2019.
- Microsoft coco: Common objects in context. In Eur. Conf. Comput. Vis., 2014.
- Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pp. 181–196, 2018.
- A metric learning reality check. In Eur. Conf. Comput. Vis., 2020.
- Probabilistic compositional embeddings for multimodal image retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 4547–4557, 2022.
- Modeling uncertainty with hedged instance embedding. In Int. Conf. Learn. Represent., 2019.
- Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for ms-coco. In Conference of the European Chapter of the Association for Computational Linguistics, 2021.
- A unified analysis of mixed sample data augmentation: A loss function perspective. In Neural Information Processing Systems (NeurIPS), 2022a.
- Probabilistic representations for video contrastive learning. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 14711–14721, 2022b.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649, 2015.
- Deep evidential learning with noisy correspondence for cross-modal retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, pp. 4948–4956, 2022. doi: 10.1145/3503161.3547922.
- Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., pp. 8748–8763. PMLR, 2021.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Association for Computational Linguistics, pp. 2556–2565, 2018.
- Probabilistic face embeddings. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 6902–6911, 2019.
- Probabilistic embeddings for speaker diarization. In Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 24–31, 2020.
- Polysemous visual-semantic embedding for cross-modal retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 1979–1988, 2019.
- View-invariant probabilistic embedding for human pose. In Eur. Conf. Comput. Vis., 2020.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Attention is all you need. In Adv. Neural Inform. Process. Syst., pp. 5998–6008, 2017.
- Consensus-aware visual-semantic embedding for image-text matching. In Eur. Conf. Comput. Vis., 2020.
- Point to rectangle matching for image text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 4977–4986, 2022.
- Bayesian metric learning for uncertainty quantification in image retrieval. arXiv preprint arXiv:2302.01332, 2023.
- Language-agnostic visual-semantic embeddings. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 5804–5813, 2019.
- Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476, 2021.
- Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 6609–6618, 2019.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Association for Computational Linguistics, 2:67–78, 2014.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Int. Conf. Comput. Vis., 2019.
- mixup: Beyond empirical risk minimization. In Int. Conf. Learn. Represent., 2018.
- How does mixup help with robustness and generalization? In Int. Conf. Learn. Represent., 2021a.
- When and how mixup improves calibration. arXiv preprint arXiv:2102.06289, 2021b.
- Vinvl: Making visual representations matter in vision-language models. In IEEE Conf. Comput. Vis. Pattern Recog., 2021c.