CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not (2303.13440v3)

Published 23 Mar 2023 in cs.CV

Abstract: In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Project page: https://aneeshan95.github.io/Sketch_LVM/

Authors (6)

Aneeshan Sain (40 papers)
Ayan Kumar Bhunia (63 papers)
Pinaki Nath Chowdhury (37 papers)
Subhadeep Koley (21 papers)
Tao Xiang (324 papers)
Yi-Zhe Song (120 papers)

Citations (59)

View on Semantic Scholar

Summary

The paper introduces a novel prompt learning adaptation for CLIP, enabling robust zero-shot sketch-based retrieval across unseen categories.
It employs regularization loss and patch shuffling to align sketches with photos at the instance level, enhancing fine-grained retrieval.
Empirical results show significant improvements, with gains of approximately 24.8% for category-level and 26.9% for fine-grained tasks.

An Overview of "CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not"

This paper investigates the application of CLIP, a prominent vision-language pre-trained model, to the domain of Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR), extending its utility to both category-level and fine-grained settings. Distinct from traditional SBIR methods, which tend to falter due to data scarcity and limited generalization, the authors leverage CLIP's inherent semantic understanding and generalization capabilities to enhance performance significantly in ZS-SBIR tasks.

Key Contributions

The core contribution of the paper is the novel adaptation of CLIP for ZS-SBIR tasks leveraging prompt learning. This adaptation harnesses CLIP's ability to model a rich semantic latent space, providing a robust foundation for sketch-photo retrieval across unseen categories. The research is bifurcated into addressing both category-level and fine-grained ZS-SBIR challenges:

Category-level ZS-SBIR: The authors present a prompt learning setup that enables CLIP to recognize category-specific traits in sketches and photos. By introducing sketch-specific prompts, they achieve a considerable margin of improvement over existing methods. This approach relies minimally on finetuning CLIP's encoders, retaining the model's broad generalization abilities.
Fine-grained ZS-SBIR: Recognizing the increased complexity of fine-grained retrieval, the authors innovate with two solutions to address the task effectively:
- Regularization Loss: To ensure a consistent separation between sketches and photos across varied categories, they introduce a regularization term that standardizes the relative distances within semantic spaces.
- Patch Shuffling: This technique supports instance-level matching by establishing structural correspondences between sketches and their corresponding photos through controlled shuffling of image patches.

Numerical Results and Implications

The paper reports remarkable improvements over current ZS-SBIR state-of-the-art baselines, with enhancements of approximately 24.8% for category-level and 26.9% for fine-grained retrieval. Such results underscore the efficacy of combining CLIP's broad semantic knowledge with tailored prompt learning to achieve significant performance gains.

Practical and Theoretical Implications

This work solidifies the potential of foundation models like CLIP in handling domain-specific tasks such as ZS-SBIR, particularly in the context of tackling challenges posed by data scarcity in sketch datasets. From a practical perspective, the introduction of prompt learning as a bridge to adapt and apply large-scale pre-trained models to narrower domains presents a paradigm shift in how future SBIR systems might be designed. Theoretically, it opens avenues for further exploration into enhancing cross-modal retrieval tasks by integrating powerful vision-LLMs.

Future Developments

The successful demonstration of CLIP's capabilities in ZS-SBIR hints at broader applications in other sketch-related fields and tasks exhibiting data paucity. Future research could focus on refining prompt learning techniques, exploring alternative model architectures, and extending these methodologies to broader datasets and tasks. As foundational models evolve, leveraging their full potential in diverse niche applications will likely emerge as a vital research direction in artificial intelligence.

In conclusion, the paper provides a crucial step toward integrating large-scale pre-trained models into specialized tasks, particularly within the burgeoning field of sketch-based image retrieval. It illustrates clear pathways for both extending state-of-the-art capabilities and addressing long-standing challenges in ZS-SBIR.

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not (2303.13440v3)

Summary

An Overview of "CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not"

Key Contributions

Numerical Results and Implications

Practical and Theoretical Implications

Future Developments

GitHub

YouTube

CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not (2303.13440v3)

Summary

An Overview of "CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not"

Key Contributions

Numerical Results and Implications

Practical and Theoretical Implications

Future Developments

Related Papers

GitHub

YouTube