Sketch-Based Image Retrieval

Updated 8 August 2025

SBIR is a cross-modal retrieval task that matches abstract hand-drawn sketches to detailed photographs, addressing geometric and semantic discrepancies.
Deep neural models, including tri-branch CNNs and domain-aware attention mechanisms, enable efficient embedding and robust cross-modal feature alignment.
Generative and zero-shot methods leverage techniques like CVAE and semantic embedding to improve retrieval performance on unseen categories.

Sketch-Based Image Retrieval (SBIR) is a cross-modal retrieval task focused on matching hand-drawn sketches to natural images. The domain has evolved rapidly, driven by advances in deep learning, cross-modal embedding, zero-shot learning, hashing, and meta-learning. SBIR is characterized by significant modality and semantic gaps between abstract, variable sketches and dense, detailed photographs, making the development of robust, efficient, and generalizable retrieval systems a central research challenge.

1. Core Challenges in SBIR

SBIR is inherently difficult due to two central issues: the geometric and semantic discrepancy between sketches and photographs, and the efficiency requirements of large-scale retrieval. Free-hand sketches usually emphasize object contours, suffer from geometric distortion (in terms of scale, rotation, and abstraction), and exhibit high intra-class variation. As a result, direct feature comparison is nontrivial, and traditional continuous-valued feature matching (often with $O(Nd)$ complexity) is computationally prohibitive for large datasets (Liu et al., 2017). Furthermore, most early SBIR models rely on training with class labels, tending to learn class-specific associations without truly generalizing to new, unseen categories (Yelamarthi et al., 2018). In fine-grained or zero-shot SBIR (ZS-SBIR), models must extrapolate to novel category instances or object poses, exacerbating the cross-modal gap.

2. Deep Architectures and Feature Alignment

To bridge the domain gap, modern SBIR frameworks leverage deep neural networks to jointly embed sketches and images into spaces amenable to efficient, discriminative retrieval. A representative example is Deep Sketch Hashing (DSH) (Liu et al., 2017), which employs a semi-heterogeneous tri-branch CNN: one branch each for (a) natural images, (b) auxiliary “sketch-token” edge maps (serving as an intermediate domain with contour-level abstractions), and (c) free-hand sketches. By learning end-to-end hash functions that map all modalities into compact binary codes, DSH enables highly efficient, Hamming distance–based large-scale search. Crucially, the use of auxiliary representations (sketch-tokens) mitigates the geometric distortion between sketches and photos, while binary embedding brings computational gains in both time and memory.

Other architectures expand on this idea with domain-aware attention mechanisms, such as the Domain-Aware Squeeze-and-Excitation (DASE) network (Lu et al., 2018), which incorporates explicit binary domain codes (photo vs. sketch) into channel-wise attention recalibration. This enables the network to emphasize modality-relevant channels, adaptively extracting features conducive to cross-domain matching.

SBIR models typically enforce cross-modal similarity using loss functions engineered for intra-class compactness and inter-class discrimination. Pairwise and triplet losses are common, but enhanced approaches have emerged. DSH (Liu et al., 2017) introduces a cross-view pairwise loss:

$\min_{B^I,B^S} \| W \odot m - (B^I)^\top B^S \|^2 \quad \text{s.t. } B^I, B^S \in \{ -1, +1 \}^{m\times n}$

where $W$ encodes semantic similarity between sketch–image pairs. Additional semantic factorization losses constrain the binary codes to reflect relationships in an external semantic space, e.g., word2vec.

The Multiplicative Euclidean Margin Softmax (MEMS) loss (Lu et al., 2018) tightens the feature space by enforcing a margin in Euclidean distance: intra-class distances are forced to be $m$ times smaller than inter-class distances,

$\mathcal{L}_{\text{mems}} = \frac{1}{N} \sum_{i=1}^N \left[ -\log \frac{\exp(-m^2 \| x_i - c_{y_i} \|^2)}{\exp(-m^2 \| x_i - c_{y_i} \|^2) + \sum_{j\ne y_i} \exp(-\| x_i - c_j \|^2)} \right]$

where $x_i$ is an embedding and $c_{y_i}$ the center of class $y_i$ . Theoretical analysis shows $m \geq 2 + \sqrt{3}$ is necessary for strict separation.

4. Generative and Zero-Shot Methods

A critical limitation of early SBIR models is their incapacity to generalize to unseen categories. Recent works address this with generative models and zero-shot learning frameworks (Yelamarthi et al., 2018, Verma et al., 2019). Instead of discriminatively associating sketches with classes, Conditional Variational Autoencoders (CVAE), Conditional Adversarial Autoencoders (CAAE), and inverse autoregressive flow–based VAEs are used to generate plausible image features from sketches. These approaches "hallucinate" image features given a query sketch, allowing retrieval via standard image–image similarity, and are essential for scenarios where test categories do not overlap with training data.

For example, the variational lower bound for a CVAE applied to SBIR is:

$\mathcal{L}(\phi, \theta; x_\text{img}, x_\text{ske}) = - \mathrm{D_{KL}}(q_\phi(z \mid x_\text{img}, x_\text{ske}) \parallel p_\theta(z \mid x_\text{ske})) + \mathbb{E}[ \log p_\theta(x_\text{img} \mid z, x_\text{ske}) ]$

Such stochastic generation is regularized to ensure that reconstructed features retain correspondence to the original sketch (e.g., via reconstruction losses).

Empirical benchmarks show that generative ZS-SBIR methods dramatically outperform discriminative baselines (e.g., CVAE achieves Precision@200 up to 0.333 versus 0.106 for VGG baselines in zero-shot settings (Yelamarthi et al., 2018)). These results indicate that generative approaches can meaningfully infer correspondences beyond the label space encoded in the training set.

5. Semantic Embedding and Side-Information Integration

To enhance generalization and overcome domain gaps, SBIR models increasingly integrate semantic priors from external information sources. SEM-PCYC (Dutta et al., 2019) proposes adversarial learning branches that project both sketch and image features into a shared semantic space informed by side information, such as word embeddings or hierarchical relationships from ontologies (e.g., WordNet). A feature selection auto-encoder compresses and discriminates side information to guide adversarial training, and a cycle-consistency constraint ensures that semantic mappings are invertible without requiring aligned sketch-image pairs.

Similarly, knowledge preservation techniques such as SAKE (Liu et al., 2019) extend the teacher–student paradigm to SBIR by fine-tuning ImageNet-trained models for the target benchmark while explicitly regularizing with semantic constraints so as to retain rich features learned in the source domain. This mitigates catastrophic forgetting and improves zero-shot generalization.

6. Applications, Efficiency, and System Adaptations

Practical deployment of SBIR in large-scale or real-time scenarios hinges on efficiency and memory constraints. Binary coding frameworks such as DSH (Liu et al., 2017) and UMAP-based reductions (Torres et al., 2021) support extreme compression (e.g., down to 16 bytes per image) without catastrophic loss in performance, which is essential for mobile or web-scale repositories. Interactive and adaptive systems—such as LiveSketch (Collomosse et al., 2019), which iteratively perturbs user queries in the latent space based on relevance feedback—enable users to disambiguate search intent and converge more efficiently on the desired target. Furthermore, compositional SBIR (Black et al., 2021) extends retrieval to multi-object queries with spatial layouts, aggregating per-object embeddings into descriptors that encode composition and applying advanced similarity metrics.

Transformers have recently emerged as superior models for SBIR, outperforming CNNs and even human annotators on standard benchmarks (e.g., VT-based models achieve Recall@1 of 62.25% vs. 54.27% for humans on Sketchy (Seddati et al., 2022)). This advantage is attributed to improved global context representation and mitigation of undesirable flip invariance.

7. Trends and Future Directions

Ongoing directions in SBIR research include:

Enhanced domain alignment via multi-modal prompting and adaptive scaling (e.g., DP-CLIP (Gao et al., 2024)), where category-adaptive prompts and channel scaling facilitate zero-shot and fine-grained retrieval by modulating feature maps with both visual and textual cues.
Advanced meta-learning solutions such as the Relation-Aware Meta-Learning Network (RAMLN) (Liu et al., 2024), which adaptively learns margin parameters in quadruplet loss formulations using external memory, leveraging bidirectional GRUs to improve cross-modal separation and generalization.
Explicit disentanglement of structure and appearance to decouple shared semantics from style or domain-specific cues (e.g., STRAD (Li et al., 2019) and StyleMeUp (Sain et al., 2021)), supporting robust retrieval in the presence of significant sketching style variation and unobserved classes.
Early and incremental retrieval from incomplete sketches, enabling real-time, user-friendly interfaces (e.g., MGAL (Dai et al., 2022)).

A plausible implication is that future SBIR systems will integrate meta-learned adaptive margins, modality-aware prompts, and disentanglement techniques, while leveraging advances in network architectures (transformers) and learned or data-driven semantic priors. Emphasis on efficient representations and interactive learning will further promote deployment in real-world, user-centric environments.

Overall, SBIR research illustrates the importance of tailored architectures, semantic-aware losses, and generative and adaptive strategies for bridging heterogeneous modalities and supporting robust cross-domain, cross-category retrieval under resource and generalization constraints.