Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval (2403.07222v2)

Published 12 Mar 2024 in cs.CV

Abstract: Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks, sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper, we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text, orchestrating a duet between the two. The end result enables precise retrievals previously unattainable, allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose, we introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models, while eliminating the need for extensive fine-grained textual descriptions. Last but not least, our system extends to novel applications in composed image retrieval, domain attribute transfer, and fine-grained generation, providing solutions for various real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Effective Conditioned and Composed Image Retrieval Combining CLIP-Based Features. In CVPR, 2022.
  2. Zero-Shot Composed Image Retrieval with Textual Inversion. In ICCV, 2023.
  3. Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image Retrieval. In CVPR, 2020.
  4. Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval. In CVPR, 2022a.
  5. Adaptive Fine-Grained Sketch-Based Image Retrieval. In ECCV, 2022b.
  6. Language Models are Few-Shot Learners. NeurIPS, 2020.
  7. LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models. In CVPR, 2023.
  8. Learning Aligned Cross-Modal Representations from Weakly Aligned Data. In CVPR, 2016.
  9. Learning to generate line drawings that convey geometry and semantics. In CVPR, 2022.
  10. Partially Does It: Towards Scene-Level FG-SBIR with Partial Input. In CVPR, 2022a.
  11. FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context. In ECCV, 2022b.
  12. SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text. In CVPR, 2023.
  13. Probabilistic Embeddings for Cross-Modal Retrieval. In CVPR, 2021.
  14. “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations. In ECCV, 2022.
  15. Sketching with Style: Visual Search with Sketches and Aesthetic Context. In ICCV, 2017.
  16. LiveSketch: Query Perturbations for Guided Sketch-based Visual Search. In CVPR, 2019.
  17. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  18. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
  19. Semantically Tied Paired Cycle Consistency for Zero-Shot Sketch-based Image Retrieval. In CVPR, 2019.
  20. Linking Image and Text with 2-Way Nets. In CVPR, 2017.
  21. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC, 2018.
  22. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In ICLR, 2023.
  23. SketchyCOCO: Image Generation from Freehand Scene Sketches. In CVPR, 2020.
  24. FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback. In CVPR, 2022.
  25. CoGS: Controllable Generation and Search from Sketch and Style. In ECCV, 2012.
  26. FashionViL: Fashion-Focused Vision-and-Language Representation Learning. In ECCV, 2022.
  27. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In ICCV, 2021.
  28. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML, 2021.
  29. Deep Visual-Semantic Alignments for Generating Image Descriptions. In CVPR, 2015.
  30. Analyzing and Improving the Image Quality of StyleGAN. In CVPR, 2020.
  31. MaPLe: Multi-modal Prompt Learning. In CVPR, 2023.
  32. Picture that Sketch: Photorealistic Image Generation from Abstract Sketches. In CVPR, 2023.
  33. How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval? In CVPR, 2024a.
  34. Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers. In CVPR, 2024b.
  35. It’s All About Your Sketch: Democratising Sketch Control in Diffusion Models. In CVPR, 2024c.
  36. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv preprint arXiv:2301.12597, 2023.
  37. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV, 2020.
  38. TC-Net for iSBIR: Triplet Classification Network for Instance-level Sketch Based Image Retrieval. In ACM MM, 2019.
  39. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
  40. Deep Sketch Hashing: Fast Free-hand Sketch-Based Image Retrieval. In CVPR, 2017.
  41. Decoupled Weight Decay Regularization. In ICLR, 2019.
  42. SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation. In ICML, 2023.
  43. SKED: Sketch-guided Text-based 3D Editing. In CVPR, 2023.
  44. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, 2018.
  45. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.08453, 2023.
  46. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM TOG, 2022.
  47. Rectified Linear Units Improve Restricted Boltzmann Machines. In ICML, 2010.
  48. DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193, 2023.
  49. Cross-domain Generative Learning for Fine-Grained Sketch-Based Image Retrieval. In BMVC, 2017.
  50. Generalising Fine-Grained Sketch-Based Image Retrieval. In CVPR, 2019.
  51. Conditional Image-Text Embedding Networks. In ECCV, 2018.
  52. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021.
  53. U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI, 2015.
  54. Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval. In BMVC, 2020.
  55. StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval. In CVPR, 2021.
  56. CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not. In CVPR, 2023a.
  57. Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR. In CVPR, 2023b.
  58. Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval. In CVPR, 2023.
  59. The Sketchy Database: Learning to Retrieve Badly Drawn Bunnies. ACM TOG, 2016.
  60. A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch. In ECCV, 2022.
  61. Fine-Grained Image Retrieval: the Text/Sketch Input Dilemma. In BMVC, 2017a.
  62. Deep Spatial-Semantic Attention for Fine-Grained Sketch-Based Image Retrieval. In ICCV, 2017b.
  63. Composing Text and Image for Image Retrieval - An Empirical Odyssey . In CVPR, 2019.
  64. Sketch-Guided Text-to-Image Diffusion Models. In SIGGRAPH Asia, 2023.
  65. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In ICCV, 2019.
  66. Distance Metric Learning for Large Margin Nearest Neighbor Classification. JMLR, 2009.
  67. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. In CVPR, 2021.
  68. DLA-Net for FG-SBIR: Dynamic Local Aligned Network for Fine-Grained Sketch-Based Image Retrieval. In ACM MM, 2021.
  69. Deep Plastic Surgery: Robust and Controllable Image Editing with Human-Drawn Sketches. In ECCV, 2020.
  70. A Zero-Shot Framework for Sketch-based Image Retrieval. In ECCV, 2018.
  71. Sketch Me That Shoe. In CVPR, 2016.
  72. Learning Structural Representations via Dynamic Object Landmarks Discovery for Sketch Recognition and Retrieval. IEEE TIP, 2019.
  73. Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV, 2023.
  74. Conditional Prompt Learning for Vision-Language Models. In CVPR, 2022a.
  75. Learning to Prompt for Vision-Language Models. IJCV, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Subhadeep Koley (21 papers)
  2. Ayan Kumar Bhunia (63 papers)
  3. Aneeshan Sain (40 papers)
  4. Pinaki Nath Chowdhury (37 papers)
  5. Tao Xiang (324 papers)
  6. Yi-Zhe Song (120 papers)
Citations (7)
Youtube Logo Streamline Icon: https://streamlinehq.com