SketchVCL and LVLMs: Tri-modal Integration

Updated 25 December 2025

The paper introduces SketchVCL, a tri-modal dataset and methodology that boosts LVLM performance on sketch understanding tasks through robust architecture and curriculum learning.
It employs advanced techniques like SAM2, Pix2Pix, and a two-stage instruction tuning process to achieve state-of-the-art results in object localization, SBIR, counting, and VQA.
The model, O3SLM, fuses shared CLIP features with a multimodal connector for effective cross-modal reasoning, demonstrating strong zero-shot generalization on unseen sketch styles.

Large Vision–LLMs (LVLMs) historically exhibit limited performance in understanding and reasoning over hand-drawn sketches—a key visual modality for expressing abstract concepts not easily captured in text or photorealistic imagery. The SketchVCL framework addresses this limitation with a tri-modal dataset and methodology that enables robust sketch comprehension, object localization, sketch-based image retrieval (SBIR), counting, and visual question answering (VQA) within LVLMs. The introduction of SketchVCL, along with the O3SLM model, demonstrates state-of-the-art results in these domains, particularly when evaluated on both seen and unseen sketch styles (Gupta et al., 18 Nov 2025).

1. Foundations and Motivation

The core challenge in sketch–LVLM alignment is the absence of large-scale, jointly annotated resources capturing the relationships among sketches, images, and language instructions. Existing LVLMs—optimized mainly for natural images and text—fail to generalize to abstracted inputs such as hand-drawn sketches, as confirmed by empirical comparisons on comprehensive benchmark suites. The need for a tri-modal corpus and model architecture supporting open-vocabulary reasoning motivates the development of SketchVCL and O3SLM (Gupta et al., 18 Nov 2025).

2. SketchVCL Dataset: Construction and Properties

SketchVCL is a large-scale tri-modal dataset comprising hand-drawn sketches ( $S$ ), photorealistic images ( $I$ ), and natural language instructions ( $T$ ). Data collection follows a two-stage process:

Stage I (Pretraining): 600,000 image–sketch–instruction triplets (300,000 from OpenImages, 300,000 from Objects365), pairing photorealistic images and automated, instance-level sketches with descriptions and bounding boxes.
Stage II (Instruction Tuning): ~215,000 samples spanning four task types: object localization (110,000), VQA (50,000), counting (30,000), and SBIR (25,000 positive, 25,000 negative), incorporating sketches from SketchVCL and established repositories (Sketchy, QuickDraw!). TU-Berlin is reserved for zero-shot evaluation.

The data generation pipeline uses SAM2 for instance segmentation, Pix2Pix for photo-to-sketch synthesis, morphological edge enhancement, and LLaMA-3-8B plus DeepSeek-VL2 for instruction and caption generation. Task prompts are prefixed (COUNT, BBOX, VQA, SBIR) and diversified for robustness. The curriculum incorporates a "SketchMIX" pool—diverse sketch sources to promote generalization.

Table: SketchVCL Dataset Splits and Sources

Stage	Sample Count	Content/Source
Pretrain	600,000	SketchVCL-Objects365, OpenImages
Tune	~215,000	Sketchy, QuickDraw!, COCO
Zero-shot	TU-Berlin	Not seen in train/tune

3. Training Methodologies and Objectives

O3SLM training employs a two-stage curriculum:

Stage I: Pretraining (Sketch Alignment):
- Formulated as next-token prediction over multi-modal tokens $(S, I, T)$ .
- Negative log-likelihood (NLL) objective:
$L_{\text{NLL}}(\theta) = - \sum_{t=1}^T \log p_\theta(w_t | w_{<t}, x)$ - Tasks: semantic sketch recognition, region alignment, open-vocabulary language modeling.
Stage II: Instruction Tuning:
- Task-specific losses atop the foundational NLL:
- SBIR: Binary cross-entropy over model's $p(\langle\text{yes}\rangle \mid X)$ for image–sketch pairs.
- Counting/Detection: NLL over numeric or coordinate tokens.
- Prompt randomization and SketchMIX data augment task robustness and mitigate overfitting.

Instruction templates and prompted questions are generated/synthesized to ensure comprehensive coverage and generalization.

4. Model Architecture: Adaptations for Sketch Comprehension

O3SLM incorporates several architectural refinements to the standard LVLM paradigm to support robust tri-modal fusion:

Visual Encoder: Shared CLIP ViT-L/336 is used for both sketches and images, yielding 384-dimensional embeddings at 336×336 resolution for spatial sensitivity.
Multimodal Connector: A two-layer MLP projects concatenated CLIP features into the LLM embedding space:

$e = W_2 (\text{ReLU}(W_1 [f_s; f_v] + b_1)) + b_2$

This yields tokens for [E_sketch; E_image; E_text].

Cross-Modal Fusion: Self-attention over the joint modality sequence, with attention weights:

$A = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d}}\right),\quad \operatorname{Attention}(Q,V) = AV$

No explicit cross-attention modules; the fusion is learned implicitly through standard transformer attention.

This design enables the model to align and reason over sketches, images, and text in a unified framework, supporting open-vocabulary generation.

5. Benchmarking, Metrics, and Quantitative Results

O3SLM is assessed on sketch-based counting, object localization, SBIR, and VQA, using accuracy, mAP, and top-K retrieval metrics:

Table: Sketch-Based Counting Accuracy (%)

Model	Sketchy	QuickDraw!	TU-Berlin†	SketchVCL-C	Avg
O3SLM-7B	41.8	33.0	50.6	48.6	43.5
Best open-weight (7B)	40.5	36.6	32.7	19.3	30.3

†: TU-Berlin is zero-shot (unseen during training).

Table: Sketch-Based Object Detection [email protected] (%)

Model	Sketchy	QuickDraw!	TU-Berlin†	SketchVCL-C
O3SLM-13B	23.7	28.1	19.2	24.8
Best baseline	11.4	10.3	11.4	11.3

Table: SBIR Top-K Accuracy on Sketchy

Model	Acc@1	Acc@5	Acc@10
LLaVA-7B	11.0	14.4	13.0
O3SLM-7B	65.0	59.2	39.4

O3SLM outperforms current open-weight LVLMs (LLaVA-7B & 13B, Qwen-VL, DeepSeek-VL2, Molmo) across all tasks. On zero-shot TU-Berlin sketches, O3SLM maintains robust generalization, with a 50.6% counting accuracy (compared to sub-30% for alternatives). Fine-grained SBIR, leveraging both sketch and textual cues, emerges without explicit training, enabling the model to refine retrieval based on complex attributes.

6. Significance, Implications, and Outlook

SketchVCL addresses the key gap in open-vocabulary, multi-modal learning for LVLMs by providing a large-scale, systematically constructed dataset featuring hand-drawn sketches, aligned images, and rich language instructions. Automated sketch synthesis and prompting enable scalable data production, while explicit curriculum learning ensures highly robust and generalizable sketch reasoning. The demonstrated gains in counting, localization, SBIR, and VQA, as well as strong zero-shot transfer, suggest that the tri-modal learning paradigm embodied by SketchVCL and O3SLM is essential for next-generation LVLM applications requiring abstract concept interpretation, and sets a clear benchmark for future research (Gupta et al., 18 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

O3SLM: Open Weight, Open Data, and Open Vocabulary Sketch-Language Model (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SketchVCL for LVLMs.

SketchVCL and LVLMs: Tri-modal Integration

1. Foundations and Motivation

2. SketchVCL Dataset: Construction and Properties

Table: SketchVCL Dataset Splits and Sources

3. Training Methodologies and Objectives

4. Model Architecture: Adaptations for Sketch Comprehension

5. Benchmarking, Metrics, and Quantitative Results

Table: Sketch-Based Counting Accuracy (%)

Table: Sketch-Based Object Detection [email protected] (%)

Table: SBIR Top-K Accuracy on Sketchy

6. Significance, Implications, and Outlook

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SketchVCL and LVLMs: Tri-modal Integration

1. Foundations and Motivation

2. SketchVCL Dataset: Construction and Properties

Table: SketchVCL Dataset Splits and Sources

3. Training Methodologies and Objectives

4. Model Architecture: Adaptations for Sketch Comprehension

5. Benchmarking, Metrics, and Quantitative Results

Table: Sketch-Based Counting Accuracy (%)

Table: Sketch-Based Object Detection [email protected] (%)

Table: SBIR Top-K Accuracy on Sketchy

6. Significance, Implications, and Outlook

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research