- The paper presents a tri-modal framework that fuses hand-drawn sketches with descriptive text to enhance fine-grained image retrieval.
- Its methodology uses ResNet and CLIP-based encoders with multi-stage optimization and curriculum learning to address cross-modal disparities.
- Experimental results on the STBIR dataset demonstrate state-of-the-art retrieval performance across varied object categories.
Fusing Structural and Semantic Modalities for Fine-Grained Image Retrieval
Introduction
The paper "Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval" (2604.15735) introduces a unified tri-modal image retrieval framework that synthesizes hand-drawn sketch contours and descriptive natural language attributes for fine-grained instance-level retrieval. This approach addresses critical limitations in single-modal queries: the inability of sketches to encode chromatic and textural information, and the inadequacy of text in capturing geometric and spatial structure. The work is distinguished by both substantive methodological innovations and the introduction of a dedicated, large-scale tri-modal dataset for the task.
Background and Challenges
While prior research on fine-grained sketch-based image retrieval (FG-SBIR) and fine-grained text-based image retrieval (FG-TBIR) has demonstrated modality-specific strengths, both paradigms are inherently limited by the semantic gap of their single input representations. Recent attempts targeting FG-STBIR (Sketch and Text Based Image Retrieval) have largely failed to bridge distribution mismatches between modalities, resulting in unstable joint training dynamics and suboptimal cross-modal alignment. Furthermore, available datasets are either exceedingly small, rely on pseudo or template-based sketches, or lack strict instance-level tri-modal pairing, impeding rigorous benchmarking.
Dataset Contributions
This paper introduces the STBIR dataset, partitioned into STBIR-S (shoes), STBIR-C (chairs), and STBIR-D (diverse daily objects), each consisting of tightly aligned image, human-drawn sketch, and LLM-generated textual description triplets. Text annotations are schema-guided and human-verified to achieve attribute-level fidelity. The STBIR-S and STBIR-C are designed for controlled, single-category retrieval, whereas STBIR-D tests large-scale, challenging class diversity. The STBIR collection is notable for genuine, instance-aligned sketches rather than synthetic or category-level surrogates, providing a high-quality tri-modal resource not seen in previous works.
Framework and Methodology
Modular Architecture
The STBIR framework is structured to directly confront the key limitations of prior approaches:
- Sketch Feature Encoding: A ResNet-based encoder extracts global geometric features from sketches, capitalizing on proven convolutional architectures for visual abstraction.
- Text and Image Encoding: Both text and image data are projected using CLIP, leveraging its large-scale pretraining for robust cross-modal mapping.
- Fusion and Retrieval: Sketch and text features are fused via element-wise addition. Retrieval proceeds by calculating cosine similarity between the query fusion and image gallery embeddings.
The Curriculum Learning Driven Robustness Enhancement (CLDRE) module incrementally increases feature-space noise during training. This simulates degraded sketch and/or text input, thereby regularizing the framework for robustness against realistic low-quality queries.
Category-Knowledge Feature Space Optimization
In addition to standard contrastive and triplet losses, the Category-Knowledge-Based Feature Space Optimization (CKFSO) head employs an angular margin-based loss, enforcing intra-class compactness and inter-class separation conditioned on category priors. This is critical for boosting discriminative power, a core requirement in fine-grained settings where intra-class variance can be minimal.
Multi-stage Cross-Modal Feature Alignment
The framework introduces a multi-stage optimization protocol to counteract parameter divergence arising from cross-modal gradient and distributional imbalance:
- Sketch Feature Mapping: Image and text encoders are frozen; the sketch encoder learns to map into the pretrained CLIP space.
- Image Feature Refinement: Sketch and text branches are frozen; the image encoder is fine-tuned to preserve structural correspondence.
- Textual Representation Integration: Only the text branch is trainable, enforcing fine-grained attribute alignment within the fused embedding space.
This sequenced optimization outperforms synchronous, monolithic training, as demonstrated by ablation.
Experimental Evaluation
Benchmarks and Baselines
Experiments span all STBIR dataset subsets. Baselines include unimodal and multimodal retrieval networks, including CLIP, DINO, SEARLE, TASKformer, and Pic2Word. Metrics focus on Recall@K, with K=1/5/10, the dominant paradigm for evaluating retrieval precision.
Results
- Superior R@1/R@5: On STBIR-S (shoes), STBIR achieves R@1 of 51.80, surpassing the strongest baseline (Pic2Word, 51.72). On STBIR-C (chairs), STBIR reaches 57.88 R@1, exceeding Pic2Word (53.38). On STBIR-D, the most challenging dataset, STBIR attains 62.85 R@1 and 93.44 R@5, both state-of-the-art except for a marginal R@10 shortfall versus SEARLE.
- Ablation: Removing fusion modalities or disabling MCFA, CKFSO, or CLDRE leads to clear performance degradation. Sequentially prioritizing sketch alignment (before images or text) empirically outperforms alternate update schedules.
- Qualitative Analysis: Visualizations show that the framework retrieves targets with high contour, color, and texture fidelity relative to the compound sketch+text queries; failure cases often stem from ambiguity in the input modalities, not from network misalignment.
Implications
Practical
This work facilitates practical, user-oriented image retrieval workflows that can accommodate noisy, ambiguous, or incomplete queries from both freehand sketches and lightweight textual descriptions—scenarios typical in creative, design, e-commerce, or forensic domains.
Theoretical
The multi-stage alignment mechanism offers a principled paradigm for tackling gradient and feature space imbalance in tri-modal joint training, a challenge recurrent in multi-modal AI but under-addressed in prior literature. Furthermore, the formulation harmonizing curriculum-driven robustness, margin-based discriminability, and staged alignment is easily extensible to other compositional retrieval tasks.
Future Directions
Future research can focus on learning more granular alignments between input sketches/texts and image subregions, incorporating localized or deformable attention. Further innovations are needed to bridge the input ambiguity gap, potentially by leveraging richer, interactive, or dialog-based annotation at query time.
Conclusion
This paper makes significant advances in tri-modal fine-grained image retrieval, systematically bridging geometric and semantic gaps by synergizing hand-drawn sketches and descriptive text attributes. The proposed STBIR dataset sets a new standard for benchmarking, and the multi-stage alignment framework demonstrates robust, state-of-the-art retrieval performance. The methodological and dataset contributions are expected to catalyze further research in robust, cross-modal visual understanding and retrieval (2604.15735).