Papers
Topics
Authors
Recent
2000 character limit reached

SketchVCL and LVLMs: Tri-modal Integration

Updated 25 December 2025
  • The paper introduces SketchVCL, a tri-modal dataset and methodology that boosts LVLM performance on sketch understanding tasks through robust architecture and curriculum learning.
  • It employs advanced techniques like SAM2, Pix2Pix, and a two-stage instruction tuning process to achieve state-of-the-art results in object localization, SBIR, counting, and VQA.
  • The model, O3SLM, fuses shared CLIP features with a multimodal connector for effective cross-modal reasoning, demonstrating strong zero-shot generalization on unseen sketch styles.

Large Vision–LLMs (LVLMs) historically exhibit limited performance in understanding and reasoning over hand-drawn sketches—a key visual modality for expressing abstract concepts not easily captured in text or photorealistic imagery. The SketchVCL framework addresses this limitation with a tri-modal dataset and methodology that enables robust sketch comprehension, object localization, sketch-based image retrieval (SBIR), counting, and visual question answering (VQA) within LVLMs. The introduction of SketchVCL, along with the O3SLM model, demonstrates state-of-the-art results in these domains, particularly when evaluated on both seen and unseen sketch styles (Gupta et al., 18 Nov 2025).

1. Foundations and Motivation

The core challenge in sketch–LVLM alignment is the absence of large-scale, jointly annotated resources capturing the relationships among sketches, images, and language instructions. Existing LVLMs—optimized mainly for natural images and text—fail to generalize to abstracted inputs such as hand-drawn sketches, as confirmed by empirical comparisons on comprehensive benchmark suites. The need for a tri-modal corpus and model architecture supporting open-vocabulary reasoning motivates the development of SketchVCL and O3SLM (Gupta et al., 18 Nov 2025).

2. SketchVCL Dataset: Construction and Properties

SketchVCL is a large-scale tri-modal dataset comprising hand-drawn sketches (SS), photorealistic images (II), and natural language instructions (TT). Data collection follows a two-stage process:

  • Stage I (Pretraining): 600,000 image–sketch–instruction triplets (300,000 from OpenImages, 300,000 from Objects365), pairing photorealistic images and automated, instance-level sketches with descriptions and bounding boxes.
  • Stage II (Instruction Tuning): ~215,000 samples spanning four task types: object localization (110,000), VQA (50,000), counting (30,000), and SBIR (25,000 positive, 25,000 negative), incorporating sketches from SketchVCL and established repositories (Sketchy, QuickDraw!). TU-Berlin is reserved for zero-shot evaluation.

The data generation pipeline uses SAM2 for instance segmentation, Pix2Pix for photo-to-sketch synthesis, morphological edge enhancement, and LLaMA-3-8B plus DeepSeek-VL2 for instruction and caption generation. Task prompts are prefixed (COUNT, BBOX, VQA, SBIR) and diversified for robustness. The curriculum incorporates a "SketchMIX" pool—diverse sketch sources to promote generalization.

Table: SketchVCL Dataset Splits and Sources

Stage Sample Count Content/Source
Pretrain 600,000 SketchVCL-Objects365, OpenImages
Tune ~215,000 Sketchy, QuickDraw!, COCO
Zero-shot TU-Berlin Not seen in train/tune

3. Training Methodologies and Objectives

O3SLM training employs a two-stage curriculum:

  • Stage I: Pretraining (Sketch Alignment):

    LNLL(θ)=t=1Tlogpθ(wtw<t,x)L_{\text{NLL}}(\theta) = - \sum_{t=1}^T \log p_\theta(w_t | w_{<t}, x) - Tasks: semantic sketch recognition, region alignment, open-vocabulary language modeling.

  • Stage II: Instruction Tuning:

    • Task-specific losses atop the foundational NLL:
    • SBIR: Binary cross-entropy over model's p(yesX)p(\langle\text{yes}\rangle \mid X) for image–sketch pairs.
    • Counting/Detection: NLL over numeric or coordinate tokens.
    • Prompt randomization and SketchMIX data augment task robustness and mitigate overfitting.

Instruction templates and prompted questions are generated/synthesized to ensure comprehensive coverage and generalization.

4. Model Architecture: Adaptations for Sketch Comprehension

O3SLM incorporates several architectural refinements to the standard LVLM paradigm to support robust tri-modal fusion:

  • Visual Encoder: Shared CLIP ViT-L/336 is used for both sketches and images, yielding 384-dimensional embeddings at 336×336 resolution for spatial sensitivity.
  • Multimodal Connector: A two-layer MLP projects concatenated CLIP features into the LLM embedding space:

e=W2(ReLU(W1[fs;fv]+b1))+b2e = W_2 (\text{ReLU}(W_1 [f_s; f_v] + b_1)) + b_2

This yields tokens for [E_sketch; E_image; E_text].

  • Cross-Modal Fusion: Self-attention over the joint modality sequence, with attention weights:

A=softmax(QKTd),Attention(Q,V)=AVA = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d}}\right),\quad \operatorname{Attention}(Q,V) = AV

No explicit cross-attention modules; the fusion is learned implicitly through standard transformer attention.

This design enables the model to align and reason over sketches, images, and text in a unified framework, supporting open-vocabulary generation.

5. Benchmarking, Metrics, and Quantitative Results

O3SLM is assessed on sketch-based counting, object localization, SBIR, and VQA, using accuracy, mAP, and top-K retrieval metrics:

Table: Sketch-Based Counting Accuracy (%)

Model Sketchy QuickDraw! TU-Berlin† SketchVCL-C Avg
O3SLM-7B 41.8 33.0 50.6 48.6 43.5
Best open-weight (7B) 40.5 36.6 32.7 19.3 30.3

†: TU-Berlin is zero-shot (unseen during training).

Table: Sketch-Based Object Detection [email protected] (%)

Model Sketchy QuickDraw! TU-Berlin† SketchVCL-C
O3SLM-13B 23.7 28.1 19.2 24.8
Best baseline 11.4 10.3 11.4 11.3

Table: SBIR Top-K Accuracy on Sketchy

Model Acc@1 Acc@5 Acc@10
LLaVA-7B 11.0 14.4 13.0
O3SLM-7B 65.0 59.2 39.4

O3SLM outperforms current open-weight LVLMs (LLaVA-7B & 13B, Qwen-VL, DeepSeek-VL2, Molmo) across all tasks. On zero-shot TU-Berlin sketches, O3SLM maintains robust generalization, with a 50.6% counting accuracy (compared to sub-30% for alternatives). Fine-grained SBIR, leveraging both sketch and textual cues, emerges without explicit training, enabling the model to refine retrieval based on complex attributes.

6. Significance, Implications, and Outlook

SketchVCL addresses the key gap in open-vocabulary, multi-modal learning for LVLMs by providing a large-scale, systematically constructed dataset featuring hand-drawn sketches, aligned images, and rich language instructions. Automated sketch synthesis and prompting enable scalable data production, while explicit curriculum learning ensures highly robust and generalizable sketch reasoning. The demonstrated gains in counting, localization, SBIR, and VQA, as well as strong zero-shot transfer, suggest that the tri-modal learning paradigm embodied by SketchVCL and O3SLM is essential for next-generation LVLM applications requiring abstract concept interpretation, and sets a clear benchmark for future research (Gupta et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SketchVCL for LVLMs.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube