You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval (2403.07222v2)
Abstract: Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks, sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper, we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text, orchestrating a duet between the two. The end result enables precise retrievals previously unattainable, allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose, we introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models, while eliminating the need for extensive fine-grained textual descriptions. Last but not least, our system extends to novel applications in composed image retrieval, domain attribute transfer, and fine-grained generation, providing solutions for various real-world scenarios.
- Effective Conditioned and Composed Image Retrieval Combining CLIP-Based Features. In CVPR, 2022.
- Zero-Shot Composed Image Retrieval with Textual Inversion. In ICCV, 2023.
- Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image Retrieval. In CVPR, 2020.
- Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval. In CVPR, 2022a.
- Adaptive Fine-Grained Sketch-Based Image Retrieval. In ECCV, 2022b.
- Language Models are Few-Shot Learners. NeurIPS, 2020.
- LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models. In CVPR, 2023.
- Learning Aligned Cross-Modal Representations from Weakly Aligned Data. In CVPR, 2016.
- Learning to generate line drawings that convey geometry and semantics. In CVPR, 2022.
- Partially Does It: Towards Scene-Level FG-SBIR with Partial Input. In CVPR, 2022a.
- FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context. In ECCV, 2022b.
- SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text. In CVPR, 2023.
- Probabilistic Embeddings for Cross-Modal Retrieval. In CVPR, 2021.
- “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations. In ECCV, 2022.
- Sketching with Style: Visual Search with Sketches and Aesthetic Context. In ICCV, 2017.
- LiveSketch: Query Perturbations for Guided Sketch-based Visual Search. In CVPR, 2019.
- ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
- Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-based Image Retrieval. In CVPR, 2019.
- Linking Image and Text with 2-Way Nets. In CVPR, 2017.
- VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC, 2018.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In ICLR, 2023.
- SketchyCOCO: Image Generation from Freehand Scene Sketches. In CVPR, 2020.
- FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback. In CVPR, 2022.
- CoGS: Controllable Generation and Search from Sketch and Style. In ECCV, 2012.
- FashionViL: Fashion-Focused Vision-and-Language Representation Learning. In ECCV, 2022.
- The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In ICCV, 2021.
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML, 2021.
- Deep Visual-Semantic Alignments for Generating Image Descriptions. In CVPR, 2015.
- Analyzing and Improving the Image Quality of StyleGAN. In CVPR, 2020.
- MaPLe: Multi-modal Prompt Learning. In CVPR, 2023.
- Picture that Sketch: Photorealistic Image Generation from Abstract Sketches. In CVPR, 2023.
- How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval? In CVPR, 2024a.
- Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers. In CVPR, 2024b.
- It’s All About Your Sketch: Democratising Sketch Control in Diffusion Models. In CVPR, 2024c.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv preprint arXiv:2301.12597, 2023.
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV, 2020.
- TC-Net for iSBIR: Triplet Classification Network for Instance-level Sketch Based Image Retrieval. In ACM MM, 2019.
- Microsoft COCO: Common Objects in Context. In ECCV, 2014.
- Deep Sketch Hashing: Fast Free-hand Sketch-Based Image Retrieval. In CVPR, 2017.
- Decoupled Weight Decay Regularization. In ICLR, 2019.
- SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation. In ICML, 2023.
- SKED: Sketch-guided Text-based 3D Editing. In CVPR, 2023.
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, 2018.
- T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.08453, 2023.
- Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM TOG, 2022.
- Rectified Linear Units Improve Restricted Boltzmann Machines. In ICML, 2010.
- DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193, 2023.
- Cross-domain Generative Learning for Fine-Grained Sketch-Based Image Retrieval. In BMVC, 2017.
- Generalising Fine-Grained Sketch-Based Image Retrieval. In CVPR, 2019.
- Conditional Image-Text Embedding Networks. In ECCV, 2018.
- Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI, 2015.
- Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval. In BMVC, 2020.
- StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval. In CVPR, 2021.
- CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not. In CVPR, 2023a.
- Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR. In CVPR, 2023b.
- Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval. In CVPR, 2023.
- The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM TOG, 2016.
- A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch. In ECCV, 2022.
- Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma. In BMVC, 2017a.
- Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval. In ICCV, 2017b.
- Composing Text and Image for Image Retrieval - An Empirical Odyssey . In CVPR, 2019.
- Sketch-Guided Text-to-Image Diffusion Models. In SIGGRAPH Asia, 2023.
- CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In ICCV, 2019.
- Distance Metric Learning for Large Margin Nearest Neighbor Classification. JMLR, 2009.
- Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. In CVPR, 2021.
- DLA-Net for FG-SBIR: Dynamic Local Aligned Network for Fine-Grained Sketch-Based Image Retrieval. In ACM MM, 2021.
- Deep Plastic Surgery: Robust and Controllable Image Editing with Human-Drawn Sketches. In ECCV, 2020.
- A Zero-Shot Framework for Sketch-based Image Retrieval. In ECCV, 2018.
- Sketch Me That Shoe. In CVPR, 2016.
- Learning Structural Representations via Dynamic Object Landmarks Discovery for Sketch Recognition and Retrieval. IEEE TIP, 2019.
- Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV, 2023.
- Conditional Prompt Learning for Vision-Language Models. In CVPR, 2022a.
- Learning to Prompt for Vision-Language Models. IJCV, 2022b.
- Subhadeep Koley (21 papers)
- Ayan Kumar Bhunia (63 papers)
- Aneeshan Sain (40 papers)
- Pinaki Nath Chowdhury (37 papers)
- Tao Xiang (324 papers)
- Yi-Zhe Song (120 papers)