Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval

Published 17 Apr 2026 in cs.CV and cs.AI | (2604.15735v1)

Abstract: Fine-grained image retrieval via hand-drawn sketches or textual descriptions remains a critical challenge due to inherent modality gaps. While hand-drawn sketches capture complex structural contours, they lack color and texture, which text effectively provides despite omitting spatial contours. Motivated by the complementary nature of these modalities, we propose the Sketch and Text Based Image Retrieval (STBIR) framework. By synergizing the rich color and texture cues from text with the structural outlines provided by sketches, STBIR achieves superior fine-grained retrieval performance. First, a curriculum learning driven robustness enhancement module is proposed to enhance the model's robustness when handling queries of varying quality. Second, we introduce a category-knowledge-based feature space optimization module, thereby significantly boosting the model's representational power. Finally, we design a multi-stage cross-modal feature alignment mechanism to effectively mitigate the challenges of cross modal feature alignment. Furthermore, we curate the fine-grained STBIR benchmark dataset to rigorously validate the efficacy of our proposed framework and to provide data support as a reference for subsequent related research. Extensive experiments demonstrate that the proposed STBIR framework significantly outperforms state of the art methods.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper presents a tri-modal framework that fuses hand-drawn sketches with descriptive text to enhance fine-grained image retrieval.
Its methodology uses ResNet and CLIP-based encoders with multi-stage optimization and curriculum learning to address cross-modal disparities.
Experimental results on the STBIR dataset demonstrate state-of-the-art retrieval performance across varied object categories.

Fusing Structural and Semantic Modalities for Fine-Grained Image Retrieval

Introduction

The paper "Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval" (2604.15735) introduces a unified tri-modal image retrieval framework that synthesizes hand-drawn sketch contours and descriptive natural language attributes for fine-grained instance-level retrieval. This approach addresses critical limitations in single-modal queries: the inability of sketches to encode chromatic and textural information, and the inadequacy of text in capturing geometric and spatial structure. The work is distinguished by both substantive methodological innovations and the introduction of a dedicated, large-scale tri-modal dataset for the task.

Background and Challenges

While prior research on fine-grained sketch-based image retrieval (FG-SBIR) and fine-grained text-based image retrieval (FG-TBIR) has demonstrated modality-specific strengths, both paradigms are inherently limited by the semantic gap of their single input representations. Recent attempts targeting FG-STBIR (Sketch and Text Based Image Retrieval) have largely failed to bridge distribution mismatches between modalities, resulting in unstable joint training dynamics and suboptimal cross-modal alignment. Furthermore, available datasets are either exceedingly small, rely on pseudo or template-based sketches, or lack strict instance-level tri-modal pairing, impeding rigorous benchmarking.

Dataset Contributions

This paper introduces the STBIR dataset, partitioned into STBIR-S (shoes), STBIR-C (chairs), and STBIR-D (diverse daily objects), each consisting of tightly aligned image, human-drawn sketch, and LLM-generated textual description triplets. Text annotations are schema-guided and human-verified to achieve attribute-level fidelity. The STBIR-S and STBIR-C are designed for controlled, single-category retrieval, whereas STBIR-D tests large-scale, challenging class diversity. The STBIR collection is notable for genuine, instance-aligned sketches rather than synthetic or category-level surrogates, providing a high-quality tri-modal resource not seen in previous works.

Framework and Methodology

Modular Architecture

The STBIR framework is structured to directly confront the key limitations of prior approaches:

Sketch Feature Encoding: A ResNet-based encoder extracts global geometric features from sketches, capitalizing on proven convolutional architectures for visual abstraction.
Text and Image Encoding: Both text and image data are projected using CLIP, leveraging its large-scale pretraining for robust cross-modal mapping.
Fusion and Retrieval: Sketch and text features are fused via element-wise addition. Retrieval proceeds by calculating cosine similarity between the query fusion and image gallery embeddings.

Curriculum-Informed Robustness

The Curriculum Learning Driven Robustness Enhancement (CLDRE) module incrementally increases feature-space noise during training. This simulates degraded sketch and/or text input, thereby regularizing the framework for robustness against realistic low-quality queries.

Category-Knowledge Feature Space Optimization

In addition to standard contrastive and triplet losses, the Category-Knowledge-Based Feature Space Optimization (CKFSO) head employs an angular margin-based loss, enforcing intra-class compactness and inter-class separation conditioned on category priors. This is critical for boosting discriminative power, a core requirement in fine-grained settings where intra-class variance can be minimal.

The framework introduces a multi-stage optimization protocol to counteract parameter divergence arising from cross-modal gradient and distributional imbalance:

Sketch Feature Mapping: Image and text encoders are frozen; the sketch encoder learns to map into the pretrained CLIP space.
Image Feature Refinement: Sketch and text branches are frozen; the image encoder is fine-tuned to preserve structural correspondence.
Textual Representation Integration: Only the text branch is trainable, enforcing fine-grained attribute alignment within the fused embedding space.

This sequenced optimization outperforms synchronous, monolithic training, as demonstrated by ablation.

Experimental Evaluation

Benchmarks and Baselines

Experiments span all STBIR dataset subsets. Baselines include unimodal and multimodal retrieval networks, including CLIP, DINO, SEARLE, TASKformer, and Pic2Word. Metrics focus on Recall@K, with K=1/5/10, the dominant paradigm for evaluating retrieval precision.

Results

Superior R@1/R@5: On STBIR-S (shoes), STBIR achieves R@1 of 51.80, surpassing the strongest baseline (Pic2Word, 51.72). On STBIR-C (chairs), STBIR reaches 57.88 R@1, exceeding Pic2Word (53.38). On STBIR-D, the most challenging dataset, STBIR attains 62.85 R@1 and 93.44 R@5, both state-of-the-art except for a marginal R@10 shortfall versus SEARLE.
Ablation: Removing fusion modalities or disabling MCFA, CKFSO, or CLDRE leads to clear performance degradation. Sequentially prioritizing sketch alignment (before images or text) empirically outperforms alternate update schedules.
Qualitative Analysis: Visualizations show that the framework retrieves targets with high contour, color, and texture fidelity relative to the compound sketch+text queries; failure cases often stem from ambiguity in the input modalities, not from network misalignment.

Implications

Practical

This work facilitates practical, user-oriented image retrieval workflows that can accommodate noisy, ambiguous, or incomplete queries from both freehand sketches and lightweight textual descriptions—scenarios typical in creative, design, e-commerce, or forensic domains.

Theoretical

The multi-stage alignment mechanism offers a principled paradigm for tackling gradient and feature space imbalance in tri-modal joint training, a challenge recurrent in multi-modal AI but under-addressed in prior literature. Furthermore, the formulation harmonizing curriculum-driven robustness, margin-based discriminability, and staged alignment is easily extensible to other compositional retrieval tasks.

Future Directions

Future research can focus on learning more granular alignments between input sketches/texts and image subregions, incorporating localized or deformable attention. Further innovations are needed to bridge the input ambiguity gap, potentially by leveraging richer, interactive, or dialog-based annotation at query time.

Conclusion

This paper makes significant advances in tri-modal fine-grained image retrieval, systematically bridging geometric and semantic gaps by synergizing hand-drawn sketches and descriptive text attributes. The proposed STBIR dataset sets a new standard for benchmarking, and the multi-stage alignment framework demonstrates robust, state-of-the-art retrieval performance. The methodological and dataset contributions are expected to catalyze further research in robust, cross-modal visual understanding and retrieval (2604.15735).

Markdown Report Issue