Back To The Drawing Board: Rethinking Scene-Level Sketch-Based Image Retrieval (2509.06566v1)

Published 8 Sep 2025 in cs.CV

Abstract: The goal of Scene-level Sketch-Based Image Retrieval is to retrieve natural images matching the overall semantics and spatial layout of a free-hand sketch. Unlike prior work focused on architectural augmentations of retrieval models, we emphasize the inherent ambiguity and noise present in real-world sketches. This insight motivates a training objective that is explicitly designed to be robust to sketch variability. We show that with an appropriate combination of pre-training, encoder architecture, and loss formulation, it is possible to achieve state-of-the-art performance without the introduction of additional complexity. Extensive experiments on a challenging FS-COCO and widely-used SketchyCOCO datasets confirm the effectiveness of our approach and underline the critical role of training design in cross-modal retrieval tasks, as well as the need to improve the evaluation scenarios of scene-level SBIR.

Summary

The paper introduces a novel Siamese encoder with a contrastive training objective that effectively addresses sketch ambiguity.
It employs a loss function minimizing KL divergence, doubling FS-COCO retrieval performance with an R@1 of 61.9%.
Leveraging pre-trained backbones like ConvNeXt, the method enhances cross-modal alignment and sets a new baseline for scene-level SBIR.

Rethinking Scene-Level Sketch-Based Image Retrieval

Introduction

The paper "Back To The Drawing Board: Rethinking Scene-Level Sketch-Based Image Retrieval" (2509.06566) addresses the challenge of Scene-Level Sketch-Based Image Retrieval (SBIR), which involves retrieving natural images that match the semantic and spatial layout of a free-hand sketch. Unlike previous approaches that concentrated on architectural enhancements of retrieval models, this work emphasizes the inherent ambiguity and noise in sketches. By focusing on a robust training objective that accommodates sketch variability, the authors demonstrate that state-of-the-art performance can be achieved without additional complexity.

Methodology

Model Design

The authors propose a method utilizing a Siamese encoder model that accepts both images and sketches. During training, embeddings are matched on a batch level, aligning similarity matrices with a target one. In the inference stage, sketches and images are compared using cosine distance. This approach leverages pre-trained vision backbones, such as ConvNeXt, particularly because of its inductive biases towards local features crucial for capturing scene-level layouts.

Figure 1: Overview of the proposed method. Our encoder model is trained in a Siamese manner, accepting both images and sketches.

Training Loss

The training process employs a contrastive learning setup that minimizes the Kullback-Leibler (KL) divergence between a learned and a supervised distribution, accounting for sketch ambiguity. This approach, identified as the Information Contrastive (ICon) framework, addresses the limitations of triplet loss, enhancing retrieval by adapting to noisy and ambiguous sketch queries.

Experimental Evaluation

The methodology was evaluated on challenging datasets like FS-COCO and SketchyCOCO. The proposed approach significantly outperformed existing methods, doubling the retrieval performance on FS-COCO with an R@1 of 61.9%, a result that also held on the unseen split for real-world deployment scenarios.

Figure 2: Qualitative results on the test split of the FS-COCO dataset.

Discussion

The advantages of the proposed method are evident in its efficiency, leveraging pre-trained models for better cross-modal alignment and its robustness to sketch variability. While the model focuses on sketch-to-image retrieval, it provides a foundation for future work that could incorporate richer multi-modal narratives and address partial or incomplete sketches. However, an emphasis on dataset improvements is necessary, as the results highlight the current datasets' limitations concerning semantic and scene-level ambiguities.

Conclusion

This paper effectively explores a straightforward yet potent framework for scene-level SBIR, achieving remarkable improvements over existing state-of-the-art methods. By focusing on pre-training, architectural choices, and a novel training loss that ameliorates sketch ambiguity, this work sets a strong baseline for future endeavors in cross-modal retrieval, encouraging advancements in dataset constructions to accommodate the needs of realistic and diverse scene retrieval tasks.