FreestyleRet: Retrieving Images from Style-Diversified Queries (2312.02428v2)
Abstract: Image Retrieval aims to retrieve corresponding images based on a given query. In application scenarios, users intend to express their retrieval intent through various query styles. However, current retrieval tasks predominantly focus on text-query retrieval exploration, leading to limited retrieval query options and potential ambiguity or bias in user intention. In this paper, we propose the Style-Diversified Query-Based Image Retrieval task, which enables retrieval based on various query styles. To facilitate the novel setting, we propose the first Diverse-Style Retrieval dataset, encompassing diverse query styles including text, sketch, low-resolution, and art. We also propose a light-weighted style-diversified retrieval framework. For various query style inputs, we apply the Gram Matrix to extract the query's textural features and cluster them into a style space with style-specific bases. Then we employ the style-init prompt tuning module to enable the visual encoder to comprehend the texture and style information of the query. Experiments demonstrate that our model, employing the style-init prompt tuning strategy, outperforms existing retrieval models on the style-diversified retrieval task. Moreover, style-diversified queries~(sketch+text, art+text, etc) can be simultaneously retrieved in our model. The auxiliary information from other queries enhances the retrieval performance within the respective query.
- Zero-shot composed image retrieval with textual inversion. arXiv preprint arXiv:2303.15247, 2023.
- Emotion-based style transfer on visual art using gram matrices. In 2021 IEEE MIT Undergraduate Research Technology Conference (URTC), pages 1–5. IEEE, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Image neural style transfer: A review. Computers and Electrical Engineering, 108:108723, 2023.
- Null-space diffusion sampling for zero-shot point cloud completion.
- Wico: Win-win cooperation of bottom-up and top-down referring image segmentation. arXiv preprint arXiv:2306.10750, 2023a.
- Parallel vertex diffusion for unified visual grounding. arXiv preprint arXiv:2303.07216, 2023b.
- Fs-coco: Towards understanding of freehand sketches of common objects in context. In European Conference on Computer Vision, pages 253–270. Springer, 2022.
- Scenetrilogy: On human scene-sketch and its complementarity with photo and text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10972–10983, 2023.
- Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur), 40(2):1–60, 2008.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Image quilting for texture synthesis and transfer. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 571–576. 2023.
- Texture synthesis by non-parametric sampling. In Proceedings of the seventh IEEE international conference on computer vision, pages 1033–1038. IEEE, 1999.
- Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
- Dialog-based interactive image retrieval. arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, 2018.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE international conference on computer vision, pages 1463–1471, 2017.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Recommendation systems: Principles, methods and evaluation. Egyptian informatics journal, 16(3):261–273, 2015.
- Visual prompt tuning. In European Conference on Computer Vision (ECCV), 2022.
- Text-video retrieval with disentangled conceptualization and set-to-set alignment. arXiv preprint arXiv:2305.12218, 2023a.
- Diffusionret: Generative text-video retrieval with diffusion model. arXiv preprint arXiv:2303.09867, 2023b.
- Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015.
- Discrete cosine transform locality-sensitive hashes for face retrieval. IEEE Transactions on multimedia, 16(4):1090–1103, 2014.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Directional texture transfer. In Proceedings of the 8th International Symposium on Non-Photorealistic Animation and Rendering, pages 43–48, 2010.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Joint learning of object graph and relation graph for visual question answering. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 01–06. IEEE, 2022a.
- Weakly-supervised 3d spatial reasoning for text-based visual question answering. IEEE Transactions on Image Processing, 2023a.
- Tg-vqa: Ternary game of video question answering. arXiv preprint arXiv:2305.10049, 2023b.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022b.
- Recent developments of content-based image retrieval (cbir). Neurocomputing, 452:675–689, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Universal style transfer via feature transforms. Advances in neural information processing systems, 30, 2017.
- Advanced deep learning techniques for image style transfer: A survey. Signal Processing: Image Communication, 78:465–470, 2019.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021a.
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022.
- Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2125–2134, 2021b.
- The model may fit you: User-generalized cross-modal retrieval. IEEE Transactions on Multimedia, 24:2998–3012, 2021.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2287–2296, 2021.
- A sketch is worth a thousand words: Image retrieval with text and sketch. In European Conference on Computer Vision, pages 251–267. Springer, 2022.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5117–5127, 2021.
- Dtrn: Dual transformer residual network for remote sensing super-resolution. In IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, pages 6041–6044. IEEE, 2023a.
- Gcrdn: Global context-driven residual dense network for remote sensing image super-resolution. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023b.
- Yilin Tao. Image style transfer based on vgg neural network model. In 2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), pages 1475–1482. IEEE, 2022.
- Interactive search in image retrieval: a survey. International Journal of Multimedia Information Retrieval, 1:71–86, 2012.
- Laurens Van Der Maaten. Learning a parametric embedding by preserving local structure. In Artificial intelligence and statistics, pages 384–391. PMLR, 2009.
- Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6439–6448, 2019.
- Rethinking and improving the robustness of image style transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 124–133, 2021.
- Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490, 2022.
- Sketchmate: Deep hashing for million-scale human sketch retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8090–8098, 2018.
- Fits: Fine-grained two-stage training for knowledge-aware question answering. arXiv preprint arXiv:2302.11799, 2023.
- Freedom: Training-free energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833, 2023.
- Instance-aware dynamic prompt tuning for pre-trained point cloud models. arXiv preprint arXiv:2304.07221, 2023.
- Changshen Zhao. A survey on image style transfer approaches using deep learning. In Journal of Physics: Conference Series, page 012129. IOP Publishing, 2020.
- Mindcamera: Interactive image retrieval and synthesis. In 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), 2018.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
- Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. ArXiv, abs/2310.01852, 2023.