Papers
Topics
Authors
Recent
2000 character limit reached

O3SLM: Open Sketch-Language Model

Updated 25 November 2025
  • O3SLM is a large vision-language model defined by its open weights, open data, and open vocabulary, robustly bridging photorealistic images with hand-drawn sketches.
  • It employs a two-stage training methodology using a novel SketchVCL dataset and integrates CLIP ViT-L/336 with Vicuna-based language models to process multimodal inputs.
  • Empirical evaluations reveal that O3SLM achieves 65% SBIR accuracy and up to 44% accuracy on counting, setting new baselines for sketch-based reasoning.

O3SLM (Open Weight, Open Data, and Open Vocabulary Sketch-LLM) is a Large Vision-LLM (LVLM) designed to address the domain gap between photorealistic images and hand-drawn sketches in multimodal AI systems. By introducing a large-scale dataset of image–sketch–instruction triplets and releasing both open-source model weights (7B, 13B) and data, O3SLM establishes a new framework for open-vocabulary, sketch-centric vision-language reasoning. The approach achieves state-of-the-art results across multiple sketch-based tasks, significantly advancing the field by supporting robust sketch comprehension, spatial reasoning, and fine-grained retrieval (Gupta et al., 18 Nov 2025).

1. Motivation and Foundations

O3SLM responds to the critical limitations of existing LVLMs—such as LLaVA, Qwen-VL, and Molmo—which excel on photorealistic images but fail to interpret abstract, artist-dependent hand-drawn sketches. The key technical bottleneck identified is the absence of a large-scale dataset delicately coupling sketches, images, and language instructions. Since sketches lack color, texture, and present significant abstraction and stylistic variability, traditional LVLMs often generate nonsensical or ungrounded outputs for sketch inputs. O3SLM targets these limitations by enabling expressive, language-agnostic query modalities grounded in sketches, facilitating complex spatial and object-centric reasoning beyond fixed-category closed-vocabulary settings.

The model is anchored in the "O3" principles:

  • Open Weight: All trained model weights (7B and 13B) are publicly released for inspection, reproducibility, and further development.
  • Open Data: The new SketchVCL dataset, alongside scripts for its generation, are freely available.
  • Open Vocabulary: By harmonizing sketch classes with Object365’s and OpenImages’ extensive taxonomies, O3SLM supports open-vocabulary queries, removing the constraints of small, pre-set class lists.

2. Dataset Construction and Integration

Central to O3SLM is the creation and open release of SketchVCL, an image–sketch–instruction triplet dataset engineered for both pretraining and instruction tuning.

  • Stage I (Pretraining): Compiles 600K triplets drawn from 250K Object365 images, 250K OpenImages images, and 50K tail-class samples for balanced distribution. Each image's target object is segmented via SAM2, masked, and transformed into a detailed sketch using a Pix2Pix-based Photo2Sketch pipeline, enhanced with morphological-gradient edge extraction. Instructions are synthesized using DeepSeek-VL2 and refined with LLaMA-3-8B Instruct to fit a task-driven template, e.g., specifying bounding box coordinates and descriptive context.
  • Stage II (Instruction Tuning): Encompasses 215K triplets, spanning four tasks—object localization (110K COCO pairs, "BBOX" prefix), VQA (50K; mix of sketch-based and image-only, "VQA" prefix), counting (30K from PixMo-Count, "COUNT" prefix), and sketch-based image retrieval (25K Sketchy positives/negatives, "SBIR" prefix).

To ensure robustness and generalization, Stage II incorporates existing datasets:

  • SketchMIX draws from Sketchy, QuickDraw!, and generated sketches on COCO/Object365.
  • TU Berlin is reserved exclusively as an unseen-style test set.
  • Class mapping aligns all data to Object365 taxonomy using CLIP-based class-name alignment, maintaining balanced class representation (≥200 sketches/class).

3. Model Architecture and Training Protocol

O3SLM’s architecture applies a modular and efficient paradigm:

  • Visual Backbone: CLIP ViT-L/336 encodes sketches (SS) and images (II) into 1×10241 \times 1024-dimensional feature vectors.
  • Multimodal Connector: A two-layer MLP with LoRA adapters transforms visual embeddings into the Vicuna LLM’s token-embedding space.
  • LLM: The backbone is Vicuna v1.5, initialized from LLaVA-1.5, consuming concatenated projected sketch, image, and tokenized instruction sequences without supplemental cross-attention layers, relying on emergent multimodal alignment from joint training.

Pretraining employs a symmetric InfoNCE contrastive objective:

Lcontrast=12Ni=1N[logexp(sim(Si,Ii)/τ)j=1Nexp(sim(Si,Ij)/τ)+logexp(sim(Ii,Si)/τ)j=1Nexp(sim(Ii,Sj)/τ)]\mathcal{L}_{\text{contrast}} = -\frac{1}{2N}\sum_{i=1}^N \Bigl[ \log \frac{\exp(\text{sim}(S_i,I_i)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(S_i,I_j)/\tau)} + \log \frac{\exp(\text{sim}(I_i,S_i)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(I_i,S_j)/\tau)} \Bigr]

Instruction tuning uses standard cross-entropy loss over next-token prediction: Linst=E(X,y)D[CE(fθ(X),y)]\mathcal{L}_{\text{inst}} = \mathbb{E}_{(X, y) \sim D}[ \ell_{\mathrm{CE}}(f_\theta(X), y) ] where X=(I,S,Prompt)X = (I, S, \text{Prompt}) and yy is the task-specific output (e.g., bounding box, integer count, free-form textual response).

Training is performed on 2×NVIDIA H100 GPUs with AdamW (weight decay 0), learning rate 2×1052\times10^{-5} (cosine decay with 3% warm-up), batch size 24, LoRA rank 64, and a single pass over pretraining and instruction-tuning splits.

4. Evaluation Schema and Task Suite

O3SLM is bench-marked on four core sketch-based multimodal tasks:

  • Object Localization: COCO val2017; Sketchy, QuickDraw!, TU Berlin, and SketchVCL-C sketches; metric is accuracy at IoU ≥ 0.5.
  • Sketch-Based Counting: PixMo-Count (single class) and COCO val (multi-class); metric is strict exact-match accuracy.
  • Sketch-Based Image Retrieval (SBIR & FG-SBIR): Sketchy (20×5×5→100-gallery per class), metric is top-K Acc@K for "yes" responses; FG-SBIR adds a fine-grained textual clause.
  • Sketch-Aware VQA: COCO + ShareGPT4V; 25K sketch-based QA pairs, evaluated by standard VQA accuracy on exact or normalized match.

5. Quantitative Benchmarks and Analytical Findings

O3SLM establishes new baselines for sketch-centric vision-language reasoning, summarized as follows:

Task O3SLM-7B O3SLM-13B Best Baseline (Open/Closed)
Sketch-Based Counting (avg) 43.5% 44.0% 17.7% (Qwen2.5-VL2) / 33.6% (GPT-4o)
TU Berlin (unseen, counting) 50.6% 24.7% (GPT-4o)
Object Detection (avg Acc) 12% 17% <4% (LLaVA-7B, Molmo-7B-D)
Object Detection (LLaVA-13B) 5.5%
SBIR Acc@1 (Sketchy) 65% 55% 11% (LLaVA-7B) / 10% (LLaVA-13B)
SBIR Acc@5 (Sketchy) 59.2% 46.4% 14.4% (LLaVA-7B) / 9.2% (LLaVA-13B)

Ablation experiments show that contrastive pretraining (Stage I) yields a ~40% relative gain for SBIR, while instruction tuning is critical for emergent fine-grained sketch-text reasoning. Fine-tuning the multimodal connector, versus freezing, yields an additional 5–6% Acc@50 for detection, at a modest parameter cost (<10M).

Qualitative results indicate O3SLM’s capacity to obey rich compositional sketch+text prompts (e.g., FG-SBIR), even without explicit fine-grained training, attributed to the VQA-oriented instruction schedule.

6. Open-Source Distribution, Limitations, and Trajectories

O3SLM’s public artifacts—pretrained weights (7B, 13B), codebase, and SketchVCL dataset—are hosted by the Vision & Computation Lab/Indian Institute of Science (https://vcl-iisc.github.io/O3SLM/).

Identified limitations include:

  • Duplication of bounding box outputs for heavily overlapping objects (amenable to classic non-maximum suppression).
  • A modest (<5%) drop on standard VQA and detection benchmarks in image-only settings, reflecting a trade-off for enhanced sketch generalization.

Planned research directions encompass support for multi-object conversational reasoning, improved vector-sketch synthesis, and iterative sketch refinement.

O3SLM constitutes the first open-weight LVLM demonstrating robust, open-vocabulary sketch understanding across localization, counting, image retrieval, and visual question answering, thereby catalyzing further research at the sketch–image interface in multimodal AI systems (Gupta et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to O3SLM.