Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GeneCIS: A Benchmark for General Conditional Image Similarity (2306.07969v1)

Published 13 Jun 2023 in cs.CV, cs.AI, cs.LG, and cs.MM

Abstract: We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States. Project page at https://sgvaze.github.io/genecis/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Compositional learning of image-text query for image retrieval. In WACV, 2021.
  2. Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
  3. Conditioned and composed image retrieval combining and partially fine-tuning clip-based features. In CVPRW, 2022.
  4. Multigrain: a unified image embedding for classes and instances. arXiv, 2019.
  5. Instructpix2pix: Learning to follow image editing instructions. ECCV, 2022.
  6. End-to-end visual editing with a generatively pre-trained artist. ECCV, 2022.
  7. Smooth-ap: Smoothing the path towards large-scale image retrieval. In ECCV, 2020.
  8. Concreteness ratings for 40 thousand generally known english word lemmas. In Behavior research methods, 2014.
  9. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020.
  10. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  11. Unconstrained face verification using deep cnn features. In WACV, 2016.
  12. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  13. Improved baselines with momentum contrastive learning. arXiv, 2020.
  14. Vqgan-clip: Open domain image generation and editing with natural language guidance. In ECCV. Springer, 2022.
  15. Measuring dataset granularity. arXiv, 2019.
  16. Artemis: Attention-based retrieval with text-explicit matching and implicit similarity. In ICLR, 2022.
  17. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020.
  19. Multi-order adversarial representation learning for composed query image retrieval. In ICASSP, 2021.
  20. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 2022.
  21. 155 Similarity. Oxford University Press, 2012.
  22. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  23. Automatic spatially-aware fashion concept discovery. In ICCV, 2017.
  24. Learning fashion compatibility with bidirectional lstms. In ACM, 2017.
  25. Part-regularized near-duplicate vehicle re-identification. In CVPR, June 2019.
  26. Deep residual learning for image recognition. In CVPR, 2016.
  27. Composed query image retrieval using locally bounded features. In CVPR, 2020.
  28. Discovering states and transformations in image collections. In CVPR, 2015.
  29. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  30. Deep metric learning: A survey. Symmetry, 11, 2019.
  31. A survey of advances in vision-based vehicle re-identification. Computer Vision and Image Understanding, 2019.
  32. Self-supervised visual attribute learning for fashion compatibility. ICCV Workshops, 2021.
  33. Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, 2022.
  34. Proxy anchor loss for deep metric learning. In CVPR, 2020.
  35. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  36. Microsoft coco: Panoptic segmentation challenge, 2017.
  37. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
  38. Clipstyler: Image style transfer with a single text condition. In CVPR, 2022.
  39. Language-driven semantic segmentation. ICLR, 2022.
  40. Microsoft coco: Common objects in context. In ECCV, 2014.
  41. Fashion outfit complementary item retrieval. In CVPR, June 2020.
  42. Large-scale vehicle re-identification in urban surveillance videos. In ICME, 2016.
  43. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  44. Image retrieval on real-life images with pre-trained vision-and-language models. In ICCV, 2021.
  45. Simple open-vocabulary object detection with vision transformers. ECCV, 2022.
  46. Effectively leveraging attributes for visual similarity. In ICCV, 2021.
  47. From red wine to red tomato: Composition with context. In CVPR, 2017.
  48. Probabilistic compositional embeddings for multimodal image retrieval. In CVPR, 2022.
  49. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv, 2021.
  50. Representation learning with contrastive predictive coding. arXiv, 2018.
  51. Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.
  52. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
  53. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, 2021.
  54. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
  55. Detecting unseen visual relations using analogies. In ICCV, 2019.
  56. Learning to predict visual attributes in the wild. In CVPR, 2021.
  57. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV, 2017.
  58. Karl Popper. Conjectures and Refutations: The Growth of Scientific Knowledge. Routledge, 1963.
  59. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In CVPR, June 2018.
  60. Learning transferable visual models from natural language supervision. In ICML, 2021.
  61. Learning with average precision: Training image retrieval with a listwise loss. In ICCV, October 2019.
  62. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  63. Revisiting training strategies and generalization performance in deep metric learning. In ICML, 2020.
  64. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS Datasets and Benchmarks, 2022.
  65. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, 2015.
  66. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  67. Deep learning face representation by joint identification-verification. In NeurIPS, 2014.
  68. Deepface: Closing the gap to human-level performance in face verification. In CVPR, June 2014.
  69. Learning similarity conditions without explicit supervision. In ICCV, 2019.
  70. Grafit: Learning fine-grained image representations with coarse labels. In ICCV, 2021.
  71. Learning type-aware embeddings for fashion compatibility. In ECCV, 2018.
  72. When does dough become a bagel? analyzing the remaining mistakes on imagenet. In NeurIPS, 2022.
  73. Conditional similarity networks. In CVPR, 2017.
  74. Composing text and image for image retrieval-an empirical odyssey. In CVPR, 2019.
  75. Learning fine-grained image similarity with deep ranking. In CVPR, June 2014.
  76. The fashion iq dataset: Retrieving images by combining side information and relative natural language feedback. CVPR, 2021.
  77. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In CVPR, 2019.
  78. What should not be contrastive in contrastive learning. ICLR, 2021.
  79. Hierarchical composition learning for composed query image retrieval. In ACM Multimedia Asia, 2021.
  80. Faces in places: Compound query retrieval. In BMVC, 2016.
  81. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
Citations (17)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com