Scene-Level SBIR: Challenges & Techniques
- Scene-Level SBIR is a retrieval system that uses free-hand sketches to identify complex scenes, addressing challenges like abstraction, spatial relationships, and partial content.
- Key methodologies include pixel-to-concept aggregation, set-based matching with optimal transport, and shared-encoder Siamese architectures for robust cross-modal alignment.
- Recent approaches improve retrieval accuracy and scalability using contrastive learning, efficient product quantization, and adaptive loss functions to tolerate sketch noise and ambiguity.
Scene-Level Sketch-Based Image Retrieval (SBIR) comprises methods and systems that retrieve natural or semantically complex images matching the content, composition, and spatial arrangement depicted in a free-hand sketch. Unlike traditional SBIR focused on isolated objects, scene-level SBIR targets holistic scenes, requiring robust handling of sketch abstraction, noise, partial depiction, intra-modal semantic gaps, and complex spatial relationships. The field combines advances in cross-modal deep metric learning, spatial encoding, set-to-set matching, and vector retrieval at scale.
1. Problem Definition and Scene-Level Challenges
Scene-Level SBIR involves retrieving natural images from large galleries using holistic, hand-drawn sketches containing multiple objects and meaningful spatial configurations. This scenario introduces several critical challenges:
- Ambiguity and Abstraction: Free-hand scene sketches are inherently noisy and ambiguous, with significant variability in stroke detail, object abstraction, and user interpretation. Human drawings often omit objects, distort spatial relations, or introduce semantic mismatches with their photographic counterparts. Pilot studies on datasets such as SketchyCOCO report that scene sketches cover on average only 13% of photo area in foreground strokes and represent just 49.7% of ground-truth objects present in the corresponding image (Chowdhury et al., 2022).
- Partialness: Two forms are prominent:
- Local partialness: Large portions of the canvas are left blank; users may draw only a subset of objects.
- Holistic partialness: Users selectively omit entire objects, leading to sketches that only partially represent the scene content.
- Spatial and Compositional Complexity: Unlike single-object sketches, scene-level queries demand models that can capture and align both the appearance of multiple entities and their relative spatial positions (Black et al., 2021).
- Semantic Gap and Retrieval Ambiguity: Even at the ground truth level, the mapping from sketch to intended image is subject to high inter-annotator variability. Empirical user studies find humans sometimes prefer retrieved images over the “ground-truth” image, underlining the subjective ambiguity of the sketch-to-photo match (Demić et al., 8 Sep 2025).
These complexities motivate frameworks that robustly model abstraction, partialness, and spatial composition for cross-modal retrieval.
2. Methods for Scene-Level Sketch Encoding and Cross-Modal Alignment
Direct Pixelwise and Semantic Abstraction
- Pixel-to-Concept Aggregation: “Query by Semantic Sketch” encodes user-sketched concept maps into compact vectors by partitioning the canvas into an grid, assigning each cell to a majority semantic label, and representing these labels in a continuous semantic embedding space (Word2vec + t-SNE) (Rossetto et al., 2019). This enables retrieval based not just on pixel similarity, but on mapped semantic and spatial relationships.
- Compositional Appearance and Layout: “Compositional Sketch Search” decomposes multi-object sketches into object crops, encodes their appearance via a CNN, reinserts object features into a canvas, and aggregates spatial relationships into a unified tensor, further processed by a small spatial encoder (Black et al., 2021). Metric learning integrates both object recognition and scene layout.
Region and Set-Based Matching
- Optimal Transport for Partialness: To address partial sketches, a set-based approach formulates cross-modal region associativity as a classic optimal transport (OT) problem. Sketches and photos are represented as sets of local CNN-derived features , and the minimum cost of aligning sketch regions to photo regions (via flows and costs ) gives a flexible, permutation-invariant matching score robust to missing content (Chowdhury et al., 2022).
- Adjacency Matrices for Holistic Layout: Intra-modal self-similarity matrices capture the spatial structure within each modality. A cross-modal weighted comparison of these adjacency matrices, masking out unaligned regions, enforces global scene layout consistency without reliance on explicit detection or grid-box proposals.
Siamese and Shared-Encoder Architectures
- Unified Feature Spaces: State-of-the-art approaches eschew separate encoders for images and sketches in favor of a shared-encoder Siamese CNN backbone (e.g., ConvNeXt-Base), enabling true cross-modal embedding and alignment (Demić et al., 8 Sep 2025). Such design avoids learning modality-specific “tricks” detrimental to robust semantic alignment.
3. Training Objectives, Loss Formulations, and Optimization
Contrastive Learning with Soft Supervision
- De-biased Contrastive Loss (“ICon”): Standard triplet or InfoNCE objectives can be brittle due to over-reliance on sparse, hard negatives. Instead, a KL-divergence–based batchwise loss is adopted, where the supervision target interpolates between a one-hot and uniform distribution, with parameter down-weighting accidental hard negatives:
with a softmax over cosine similarities, and (Demić et al., 8 Sep 2025). This provides built-in tolerance to sketch noise and intra-batch ambiguity.
Set-based and Region-Aligned Metric Losses
- Region Matching (OT) Loss: Cross-modal region matching distance , solved via a differentiable QP solver (QPTH), is integrated with a triplet-style margin loss for local structure alignment; global scene structure is regularized with a similarly-margined adjacency loss (Chowdhury et al., 2022):
Auxiliary and Regularization Terms
- Classification and Similarity Losses: Auxiliary object-class classification and similarity losses (cosine-based) are incorporated to stabilize metric learning in deep scene representation pipelines (Black et al., 2021).
4. Experimental Protocols, Evaluation Metrics, and Benchmark Results
Datasets
| Dataset | Characteristics | Size |
|---|---|---|
| FS-COCO | 10,000 human-drawn scene sketches, paired with COCO imgs | 2 splits |
| SketchyCOCO | 14,081 synthetic scene sketches, multi-object composition | 1,225 train |
| SketchyScene | 2,472 train / 252 test (avg. 16 instances per scene) | Scene-level |
| QMUL-Shoe-V2 | Object-level for generalization, with partial masking | ~2,000 pairs |
| OI-Test-LQ/SQ | Synthesized multi-object queries from OpenImages | >11,000 |
| Stock4.5M | 4.5M stock photos, compositional sketch queries | Large-scale |
Metrics
- Recall@K (R@1, R@5, R@10): Percentage of queries where the ground truth image is in the top K retrieved.
- mAP@200, NDCG@200, Precision@20: Used for large-scale and multi-instance benchmarks; relevance defined by object-level semantic and spatial overlap (see (Black et al., 2021)).
- Acc.@K: Top-K accuracy, especially for evaluation under progressive object masking (partialness protocol).
Representative Results
| Method | FS-COCO R@1 (Normal/Unseen) | SketchyCOCO R@1 | SketchyScene Acc.@1 (no mask) | SketchyCOCO Acc.@1 (pₘ=0/0.5) |
|---|---|---|---|---|
| Siam-VGG | 23.3 / 10.6 | 37.6 | 4.5% | 6.2% / <0.1% |
| SceneTrilogy | 24.1 / — | 38.2 | — | — |
| Partially Does It (OT) | — | 34.5 | 35.7% / 10.6% | 34.5% / 19.2% |
| SceneSketcherV2 | — | 68.1 | — | — |
| ICon+ConvNeXt (SOTA) | 61.9 / 60.0 | 70.0 | — | — |
Scene-level retrieval accuracy increases substantially with architectures that incorporate robust cross-modal pre-training (CLIP), region- and set-aware losses, and architectures expressly designed for cross-modal abstraction tolerance (Chowdhury et al., 2022, Demić et al., 8 Sep 2025).
5. Robustness to Abstraction and Partialness
Existing scene-level SBIR systems degrade rapidly under increasing sketch partialness. When masking 50% of objects in scene sketches, baseline Acc.@1 collapses to near zero for Triplet-based methods, but set-based OT and adjacency alignment methods retain higher accuracy (e.g., Acc.@1 19.2% on SketchyCOCO at ) (Chowdhury et al., 2022). The use of soft batchwise supervision (ICon loss) further improves tolerance to ambiguities arising from user abstraction and missing content (Demić et al., 8 Sep 2025). Quantitative studies and human evaluations confirm that a portion of “failure” cases are due to subjective ground-truth assignment rather than representational inadequacy.
6. System Scalability, Efficiency, and Practical Considerations
Embedding Compression and Efficient Retrieval
- Product Quantization (PQ): To enable million-scale retrieval, high-dimensional CNN scene embeddings (e.g., ) are compressed via two-stage PQ with offline rotation (OPQ). Database vectors are stored as 16-byte codes, supporting sub-linear retrieval latency (Black et al., 2021).
- Storage vs. Retrieval Trade-offs: Semantic-sketch feature vectors can be tuned for dimensionality and word embedding granularity, allowing control of storage requirements and retrieval fidelity (as low as 4.2% of bit budget per (Rossetto et al., 2019)).
- Runtime Performance: Linear scan over 1M+ keyframes achieves ~1 second per query on standard CPU hardware; PQ, HNSW, and related ANN structures are supported for real-world deployments (Rossetto et al., 2019).
Architectural Simplicity
Recent findings indicate that model complexity (e.g., multi-branch pipelines, object detectors) is often unnecessary: with robust pre-training, shared-encoder design, and appropriately “softened” contrastive losses, SOTA performance is achieved without architectural augmentation (Demić et al., 8 Sep 2025). Data augmentation, batch mining, and inference optimizations (AMP, OpenCV OT solver) further enhance scalability and efficiency.
7. Open Problems and Future Directions
Despite advances, several challenges persist:
- Extreme Partialness: All existing methods degrade as sketch completeness drops below 30%; better object proposal or segmentation is needed to recover context from severely sparse sketches (Chowdhury et al., 2022).
- Semantic and Layout Ambiguity: Ground truth for scene-level SBIR is inherently ambiguous; alternative evaluation regimes that include human preference and non-unique answers are advocated (Demić et al., 8 Sep 2025).
- Spatial Reasoning Beyond Uniform Grids: Uniform region grids may not map well to hand-drawn object boundaries; learned objectness proposals and graph-based representations are proposed as future enhancements (Chowdhury et al., 2022).
- Exploiting Multimodal Inputs: Integration of vision-LLMs (e.g., CLIP) and combination of sketch with text input (as in “A Sketch Is Worth a Thousand Words,” (Sangkloy et al., 2022)) is an emergent line expected to yield further gains.
- Efficient Differentiable OT: Replacing classical QP solvers with GPU-friendly Sinkhorn distance may enable scalable end-to-end learning for large region-sets and image archives (Chowdhury et al., 2022).
- Dataset Expansion and Realism: The collection of larger, more diverse benchmarks with multiple annotations per sketch, richer preference modeling, and systematic coverage of scene layout variation remains a priority (Demić et al., 8 Sep 2025).
A plausible implication is that advances in set-based metric learning, region-aware cross-modal alignment, and semantically grounded representations will drive the next generation of scene-level SBIR systems, significantly broadening the applicability of sketch-based retrieval for complex real-world scenarios.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free