Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improved Probabilistic Image-Text Representations (2305.18171v5)

Published 29 May 2023 in cs.CV and cs.LG

Abstract: Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; the burden of heavy computations due to the Monte Carlo approximation, and the loss saturation issue in the face of abundant false negatives. To overcome the issues, this paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution. In addition, two optimization techniques are proposed to enhance PCME++ further: first, the incorporation of pseudo-positives to prevent the negative effect under massive false negatives; second, mixed sample data augmentation for probabilistic matching. Experimental results on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. The robustness of PCME++ is also evaluated under noisy image-text correspondences. In addition, the potential applicability of PCME++ in automatic prompt-filtering for zero-shot classification is shown. The code is available at https://github.com/naver-ai/pcmepp

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Deep variational information bottleneck. In Int. Conf. Learn. Represent., 2017. URL https://openreview.net/forum?id=HyxQzBceg.
  2. Is an image worth five sentences? a new look into semantics for image-text matching. In IEEE/CVF Winter Conf. App. Comput. Vis., pp.  1391–1400, 2022.
  3. Swad: Domain generalization by seeking flat minima. In Adv. Neural Inform. Process. Syst., 2021.
  4. Data uncertainty learning in face recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  5710–5719, 2020.
  5. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  3558–3568, 2021.
  6. Learning the best pooling strategy for visual semantic embedding. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  15789–15798, 2021.
  7. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  8. Probabilistic embeddings for cross-modal retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., 2021.
  9. Eccv caption: Correcting false negatives by collecting machine-and-human-verified image-caption associations for ms-coco. In Eur. Conf. Comput. Vis., 2022.
  10. RedCaps: Web-curated image-text data created by the people, for the people. In NeurIPS Datasets and Benchmarks, 2021.
  11. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. Learn. Represent., 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  13. VSE++: Improving visual-semantic embeddings with hard negatives. In Brit. Mach. Vis. Conf., 2018.
  14. Devise: A deep visual-semantic embedding model. In Adv. Neural Inform. Process. Syst., pp.  2121–2129, 2013.
  15. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  7181–7189, 2018.
  16. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. In Int. Conf. Learn. Represent., 2021.
  17. Learning semantic concepts and order for image and sentence matching. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  6163–6171, 2018.
  18. Learning with noisy correspondence for cross-modal matching. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Adv. Neural Inform. Process. Syst., 2021. URL https://openreview.net/forum?id=S9ZyhWC17wJ.
  19. Averaging weights leads to wider optima and better generalization. Conference on Uncertainty in Artificial Intelligence, 2018.
  20. Map: Multimodal uncertainty-aware vision-language pre-training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23262–23271, 2023.
  21. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  22. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  3128–3137, 2015.
  23. Improving cross-modal retrieval with set of diverse embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23422–23431, 2023.
  24. Nsml: Meet the mlaas platform with a real-world case study. arXiv preprint arXiv:1810.09957, 2018.
  25. Vilt: Vision-and-language transformer without convolution or region supervision. In Int. Conf. Mach. Learn., 2021.
  26. Adam: A method for stochastic optimization. In Int. Conf. Learn. Represent., 2015.
  27. Probabilistic contrastive learning recovers the correct aleatoric uncertainty of ambiguous inputs. In International Conference on Machine Learning, 2023.
  28. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
  29. Stacked cross attention for image-text matching. In Eur. Conf. Comput. Vis., 2018.
  30. A differentiable semantic metric approximation in probabilistic embedding for cross-modal retrieval. Advances in Neural Information Processing Systems, 35:11934–11946, 2022a.
  31. Dividemix: Learning with noisy labels as semi-supervised learning. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HJgExaVtwr.
  32. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022b.
  33. Visual semantic reasoning for image-text matching. In Int. Conf. Comput. Vis., pp.  4654–4662, 2019.
  34. Microsoft coco: Common objects in context. In Eur. Conf. Comput. Vis., 2014.
  35. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pp.  181–196, 2018.
  36. A metric learning reality check. In Eur. Conf. Comput. Vis., 2020.
  37. Probabilistic compositional embeddings for multimodal image retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  4547–4557, 2022.
  38. Modeling uncertainty with hedged instance embedding. In Int. Conf. Learn. Represent., 2019.
  39. Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for ms-coco. In Conference of the European Chapter of the Association for Computational Linguistics, 2021.
  40. A unified analysis of mixed sample data augmentation: A loss function perspective. In Neural Information Processing Systems (NeurIPS), 2022a.
  41. Probabilistic representations for video contrastive learning. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  14711–14721, 2022b.
  42. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp.  2641–2649, 2015.
  43. Deep evidential learning with noisy correspondence for cross-modal retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, pp.  4948–4956, 2022. doi: 10.1145/3503161.3547922.
  44. Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., pp.  8748–8763. PMLR, 2021.
  45. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  46. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Association for Computational Linguistics, pp. 2556–2565, 2018.
  47. Probabilistic face embeddings. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  6902–6911, 2019.
  48. Probabilistic embeddings for speaker diarization. In Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pp.  24–31, 2020.
  49. Polysemous visual-semantic embedding for cross-modal retrieval. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  1979–1988, 2019.
  50. View-invariant probabilistic embedding for human pose. In Eur. Conf. Comput. Vis., 2020.
  51. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  52. Attention is all you need. In Adv. Neural Inform. Process. Syst., pp.  5998–6008, 2017.
  53. Consensus-aware visual-semantic embedding for image-text matching. In Eur. Conf. Comput. Vis., 2020.
  54. Point to rectangle matching for image text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pp.  4977–4986, 2022.
  55. Bayesian metric learning for uncertainty quantification in image retrieval. arXiv preprint arXiv:2302.01332, 2023.
  56. Language-agnostic visual-semantic embeddings. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  5804–5813, 2019.
  57. Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476, 2021.
  58. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In IEEE Conf. Comput. Vis. Pattern Recog., pp.  6609–6618, 2019.
  59. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Association for Computational Linguistics, 2:67–78, 2014.
  60. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Int. Conf. Comput. Vis., 2019.
  61. mixup: Beyond empirical risk minimization. In Int. Conf. Learn. Represent., 2018.
  62. How does mixup help with robustness and generalization? In Int. Conf. Learn. Represent., 2021a.
  63. When and how mixup improves calibration. arXiv preprint arXiv:2102.06289, 2021b.
  64. Vinvl: Making visual representations matter in vision-language models. In IEEE Conf. Comput. Vis. Pattern Recog., 2021c.
Citations (15)

Summary

  • The paper introduces PCME++, a probabilistic embedding method that significantly outperforms traditional approaches in image-text matching.
  • It addresses loss saturation and false negatives by incorporating Pseudo-Positives and Mixed Sample Data Augmentation.
  • It achieves scalable efficiency with a closed-form sampled distance, enabling integration with large-scale ANN search systems.

Overview of Improved Probabilistic Image-Text Representations

The paper "Improved Probabilistic Image-Text Representations" addresses the challenges inherent in Image-Text Matching (ITM) tasks, which stem from the ambiguity introduced by many-to-many correspondences and imperfect annotations. The deterministic methods traditionally employed in ITM fall short due to their inability to appropriately capture such ambiguities. Thus, this research explores the use of probabilistic embeddings to enhance cross-modal representations addressing computational and loss saturation challenges faced by previous probabilistic ITM approaches.

Key Contributions

  1. Probabilistic Cross-Modal Embeddings (PCME++): The paper introduces PCME++, an embodiment of improved probabilistic embeddings for ITM. The method employs a novel probabilistic distance that yields a closed-form solution, significantly optimizing computation and accuracy relative to traditional approaches reliant on Monte Carlo approximations.
  2. Handling Loss Saturation and False Negatives:
    • Pseudo-Positives (PP): This technique is incorporated to counteract the adverse impact of a vast number of false negatives present within the dataset.
    • Mixed Sample Data Augmentation (MSDA): This consists of strategies such as Mixup and CutMix tailored for probabilistic settings, further strengthening the resilience of PCME++ to dataset imperfections.
  3. Efficiency and Scalability: By introducing a closed-form sampled distance (CSD), PCME++ markedly reduces the computational burden associated with probabilistic embeddings, enabling scalability to larger datasets. This makes it applicable to large-scale image-text datasets and facilitates straightforward integration with existing approximate nearest neighbor (ANN) search systems like FAISS.

Evaluation and Results

The experimental evaluation demonstrates that PCME++ consistently outperforms state-of-the-art methods in ITM tasks on standard datasets like MS-COCO Caption, CxC, and ECCV Caption. Noteworthy among the findings is the performance resilience of PCME++ under conditions of noisy image-text correspondences—a critical aspect for real-world applications. Additionally, using PCME++ significantly enhances the accuracy of zero-shot classification tasks via automatic prompt-filtering, showcasing its applicability beyond traditional ITM tasks.

Theoretical Implications

Theoretically, this research contributes to the understanding and application of probabilistic embeddings within the domains of vision and language, specifically regarding how uncertainty can be leveraged to address dataset ambiguity and false negatives. This aligns with broader efforts to transcend deterministic methodologies in AI with probabilistic approaches that inherently account for uncertainty and variability in data.

Practical Implications and Future Directions

The practical implications of this research are substantial, both in terms of improved image-text retrieval performance and in facilitating scalability to large datasets. The potential for PCME++ to enhance zero-shot learning through automatic prompt selection further highlights its versatility. Future developments may explore extending this probabilistic methodology to other types of data representations and embeddings, as well as investigating alternative probabilistic distributions or densities that could provide even greater efficacy.

In conclusion, this paper provides a comprehensive solution to longstanding issues in ITM tasks, making it a significant contribution to the field of vision-language research. By transitioning from deterministic to probabilistic frameworks, PCME++ sets a precedent for future explorations in probabilistic AI methodologies.

X Twitter Logo Streamline Icon: https://streamlinehq.com