Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FreestyleRet: Retrieving Images from Style-Diversified Queries (2312.02428v2)

Published 5 Dec 2023 in cs.CV and cs.IR

Abstract: Image Retrieval aims to retrieve corresponding images based on a given query. In application scenarios, users intend to express their retrieval intent through various query styles. However, current retrieval tasks predominantly focus on text-query retrieval exploration, leading to limited retrieval query options and potential ambiguity or bias in user intention. In this paper, we propose the Style-Diversified Query-Based Image Retrieval task, which enables retrieval based on various query styles. To facilitate the novel setting, we propose the first Diverse-Style Retrieval dataset, encompassing diverse query styles including text, sketch, low-resolution, and art. We also propose a light-weighted style-diversified retrieval framework. For various query style inputs, we apply the Gram Matrix to extract the query's textural features and cluster them into a style space with style-specific bases. Then we employ the style-init prompt tuning module to enable the visual encoder to comprehend the texture and style information of the query. Experiments demonstrate that our model, employing the style-init prompt tuning strategy, outperforms existing retrieval models on the style-diversified retrieval task. Moreover, style-diversified queries~(sketch+text, art+text, etc) can be simultaneously retrieved in our model. The auxiliary information from other queries enhances the retrieval performance within the respective query.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Zero-shot composed image retrieval with textual inversion. arXiv preprint arXiv:2303.15247, 2023.
  2. Emotion-based style transfer on visual art using gram matrices. In 2021 IEEE MIT Undergraduate Research Technology Conference (URTC), pages 1–5. IEEE, 2021.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Image neural style transfer: A review. Computers and Electrical Engineering, 108:108723, 2023.
  5. Null-space diffusion sampling for zero-shot point cloud completion.
  6. Wico: Win-win cooperation of bottom-up and top-down referring image segmentation. arXiv preprint arXiv:2306.10750, 2023a.
  7. Parallel vertex diffusion for unified visual grounding. arXiv preprint arXiv:2303.07216, 2023b.
  8. Fs-coco: Towards understanding of freehand sketches of common objects in context. In European Conference on Computer Vision, pages 253–270. Springer, 2022.
  9. Scenetrilogy: On human scene-sketch and its complementarity with photo and text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10972–10983, 2023.
  10. Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur), 40(2):1–60, 2008.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  13. Image quilting for texture synthesis and transfer. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 571–576. 2023.
  14. Texture synthesis by non-parametric sampling. In Proceedings of the seventh IEEE international conference on computer vision, pages 1033–1038. IEEE, 1999.
  15. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
  16. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  17. Dialog-based interactive image retrieval. arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, 2018.
  18. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  19. Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE international conference on computer vision, pages 1463–1471, 2017.
  20. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  21. Recommendation systems: Principles, methods and evaluation. Egyptian informatics journal, 16(3):261–273, 2015.
  22. Visual prompt tuning. In European Conference on Computer Vision (ECCV), 2022.
  23. Text-video retrieval with disentangled conceptualization and set-to-set alignment. arXiv preprint arXiv:2305.12218, 2023a.
  24. Diffusionret: Generative text-video retrieval with diffusion model. arXiv preprint arXiv:2303.09867, 2023b.
  25. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
  26. Discrete cosine transform locality-sensitive hashes for face retrieval. IEEE Transactions on multimedia, 16(4):1090–1103, 2014.
  27. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  28. Directional texture transfer. In Proceedings of the 8th International Symposium on Non-Photorealistic Animation and Rendering, pages 43–48, 2010.
  29. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  30. Joint learning of object graph and relation graph for visual question answering. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 01–06. IEEE, 2022a.
  31. Weakly-supervised 3d spatial reasoning for text-based visual question answering. IEEE Transactions on Image Processing, 2023a.
  32. Tg-vqa: Ternary game of video question answering. arXiv preprint arXiv:2305.10049, 2023b.
  33. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022b.
  34. Recent developments of content-based image retrieval (cbir). Neurocomputing, 452:675–689, 2021.
  35. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  36. Universal style transfer via feature transforms. Advances in neural information processing systems, 30, 2017.
  37. Advanced deep learning techniques for image style transfer: A survey. Signal Processing: Image Communication, 78:465–470, 2019.
  38. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021a.
  39. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022.
  40. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2125–2134, 2021b.
  41. The model may fit you: User-generalized cross-modal retrieval. IEEE Transactions on Multimedia, 24:2998–3012, 2021.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  43. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2287–2296, 2021.
  44. A sketch is worth a thousand words: Image retrieval with text and sketch. In European Conference on Computer Vision, pages 251–267. Springer, 2022.
  45. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  46. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5117–5127, 2021.
  47. Dtrn: Dual transformer residual network for remote sensing super-resolution. In IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, pages 6041–6044. IEEE, 2023a.
  48. Gcrdn: Global context-driven residual dense network for remote sensing image super-resolution. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023b.
  49. Yilin Tao. Image style transfer based on vgg neural network model. In 2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), pages 1475–1482. IEEE, 2022.
  50. Interactive search in image retrieval: a survey. International Journal of Multimedia Information Retrieval, 1:71–86, 2012.
  51. Laurens Van Der Maaten. Learning a parametric embedding by preserving local structure. In Artificial intelligence and statistics, pages 384–391. PMLR, 2009.
  52. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6439–6448, 2019.
  53. Rethinking and improving the robustness of image style transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 124–133, 2021.
  54. Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490, 2022.
  55. Sketchmate: Deep hashing for million-scale human sketch retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8090–8098, 2018.
  56. Fits: Fine-grained two-stage training for knowledge-aware question answering. arXiv preprint arXiv:2302.11799, 2023.
  57. Freedom: Training-free energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833, 2023.
  58. Instance-aware dynamic prompt tuning for pre-trained point cloud models. arXiv preprint arXiv:2304.07221, 2023.
  59. Changshen Zhao. A survey on image style transfer approaches using deep learning. In Journal of Physics: Conference Series, page 012129. IOP Publishing, 2020.
  60. Mindcamera: Interactive image retrieval and synthesis. In 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), 2018.
  61. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
  62. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
  63. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. ArXiv, abs/2310.01852, 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.