Papers
Topics
Authors
Recent
2000 character limit reached

SketchVCL: Multi-View & LVLM Dataset

Updated 25 November 2025
  • SketchVCL is a dual-mode dataset offering detailed multi-view sketch correspondences and large-scale sketch–image–instruction triplets for advanced vision-language modeling.
  • It provides 6,852 rendered sketches from 587 3D models with dense pixel-level annotations and 650,000 triplets covering 965 object categories for varied tasks.
  • Automated pipelines ensure consistent quality in semantic correspondence, object detection, counting, and sketch-based retrieval, facilitating robust LVLM training.

SketchVCL refers to two distinct datasets created for large-scale research in sketch understanding, vision-language modeling, and multi-view geometric correspondence. The earliest instance, "SketchVCL Multi-View" (or multi-view line-drawing correspondence dataset) was originally introduced in SketchDesc (Yu et al., 2020), targeting pixel-level semantic correspondence across multi-view sketches rendered from 3D objects. Subsequently, "SketchVCL" was re-established as a massively larger, multimodal collection of sketch–image–instruction triplets for pretraining, instruction tuning, and evaluation of Large Vision-LLMs (LVLMs), as exemplified by O3SLM (Gupta et al., 18 Nov 2025). Both datasets are designed to address the lack of large-scale, standardized resources linking the sketch modality—whether synthetic, rendered from 3D, or hand-drawn—to concrete semantic, geometric, and language supervision signals.

1. Dataset Design and Scope

This dataset is constructed for benchmarking multi-view sketch correspondence learning. It consists of 6,852 raster sketches generated as line drawings from 587 3D models drawn from three public sources: Structure-Recovery [Shen et al. 2012], Princeton Segmentation Benchmark (PSB) [Chen et al. 2009], and ShapeNet [Yi et al. 2016]. Each shape is rendered from 11 or 12 viewpoints sampled over the upper hemisphere, producing high-resolution binary edge maps (480×480 px) with 5–10% line pixel density. Semantic correspondences are annotated at the pixel level by projecting visible mesh vertices across different views, yielding 28,000–60,000 paired correspondences per shape, or roughly 20–30 million correspondences in total. The annotations provide dense, metric-learning-ready data for patch-level descriptor supervision.

The modern "SketchVCL" comprises approximately 650,000 image–sketch–instruction triplets built over 3.7 million images and 32 million instance-level sketch generation outputs. Covering 965 object categories (600 from OpenImages, 365 from Object365), its pretraining split is balanced across Object365 and OpenImages (300,000 triplets each), with an additional ≈50,000 COCO-based triplets for instruction tuning. Task-specific splits exist for object detection (110,000), visual question answering (50,000), counting (30,000), and sketch-based image retrieval (25,000), with an instruction-tuning pool totaling about 215,000. This dataset unifies synthetic pipeline sketches and hand-drawn exemplars (from Sketchy, QuickDraw!, and Tu-Berlin), enforcing balanced coverage via CLIP-embeddings for taxonomic alignment.

2. Data Generation and Annotation Protocols

Multi-View Sketch Generation

Each 3D mesh is first normalized and oriented upright. Camera viewpoints are distributed on the upper hemisphere at elevations of 15°–45° and azimuths in 15° or 30° intervals to create uniformly spaced sketches. For each viewpoint, a normal map is rendered, followed by Canny edge detection and hidden-line removal; resulting binary edge maps are standardized at 480×480 pixels.

For pixel-level annotation, mesh vertices are sampled and projected to 2D only if visible (passing the depth test) in at least two views. For every ordered view pair (i, j), lists of correspondences Corrᵢ⟶ⱼ = { (uᵢ, vᵢ, uⱼ, vⱼ) } are recorded. Data is stored in per-shape, per-view-pair CSVs adhering to a conventional (u, v) coordinate system.

Sketch-Image-Instruction Triplet Construction

The pipeline for O3SLM's SketchVCL begins by masking out backgrounds using SAM2 for any instance-segmented object in a natural image, then generating a coarse sketch representation via a Pix2Pix-based Photo2Sketch model, and finally adding high-frequency edge detail with morphological gradients. Approximately 33 million sketches were generated over all splits (19 million for Object365, 14 million for OpenImages).

Textual instructions are model-generated: captions from DeepSeek-VL2 are templated with LLaMA-3-8B, specifying sketch identification, object description, scene context, and normalized bounding box coordinates. For instruction-tuning, different prompt pools (object detection, counting, VQA, SBIR) are generated using DeepSeek-VL2, ShareGPT4V, and hand-curated templates, sampled with paraphrastic diversity to mitigate prompt overfitting. No manual annotation is applied; all quality control is automated via segmentation and statistical balancing (extra sampling for tail classes with <5,000 instances).

3. Structure, Formats, and Access Patterns

Data is organized as follows:

Folder/File Contents Format/Notes
images/ 6,852 per-view binary edge maps 480×480 PNG, per-shape
correspondences/ Dense per-view-pair pixel correspondences CSV: (u₁, v₁, u₂, v₂)
splits/ Train/val/test partitions (no shape/view leakage) Text file lists
meta.json, README Metadata and documentation Category names, view info
LICENSE CC-BY-NC 4.0 or MIT (code)

Data consists of flat triplet records, with each triplet linking a sketch, a photorealistic instance image, and its accompanying instruction prompt. SketchMIX provides a balanced pool per class (≥200 exemplars/class), combining synthetic and hand-drawn sources.

Data partitioning includes a pretraining split, tuning splits for four downstream tasks, and tailored pools per task. Metadata (category/taxonomy, bounding box, mask) is encoded to enable seamless integration into LVLM workflows.

4. Benchmark Protocols and Evaluation

For multi-view correspondence (Yu et al., 2020), the annotated correspondences enable training of local patch descriptors, e.g., with networks maximizing a margin m=1m=1 in the embedding space such that Dm(pa,p+)+mDm(pa,p)D_m(\mathbf{p}_a, \mathbf{p}_+)+m \leq D_m(\mathbf{p}_a,\mathbf{p}_-), where p\mathbf{p} denotes patch embeddings. Standard evaluation applies metric learning tasks and cross-view matching.

The LVLM-oriented SketchVCL establishes evaluation protocols for multiple sketch-centric tasks:

  • Object Detection: Acc, [email protected], mAP, [email protected] with category-wise and size-based breakdowns.
  • Counting: Accuracy as Accuracy=(1/N)i=1N1[c^i=ci]\mathrm{Accuracy} = (1/N) \sum_{i=1}^N 1[\hat{c}_i = c_i].
  • SBIR: Binary cross-entropy for yes/no sketch-to-image retrieval; argmax retrieval for ranking.
  • VQA: Multi-turn comprehension via ShareGPT4V and LLaMA-3-driven prompts.

Recommended training practice is to pretrain on the full Stage I split to align all modalities, with subsequent downstream fine-tuning on both the multimodal projector (e.g., LoRA on CLIP→LLM) and LLM head. Prompt template randomization is encouraged for robustness.

5. Comparative Significance and Contributions

SketchVCL's multi-view correspondence benchmark supersedes prior efforts by providing exhaustive dense correspondences over rendered sketches rather than schematic silhouettes or sparse keypoint matches. The LVLM-era SketchVCL outscales prior hand-drawn sketch datasets—such as TU-Berlin (20,000), Sketchy (75,000), and QuickDraw! (50 million drawings without paired images)—by offering over 650,000 triplets, 32 million instance-level sketches, and paired photorealistic imagery with rich, automated language annotations. Unique properties include:

  • Instance-level, not just class-level, sketch-image pairs
  • Paired segmentation masks (not provided in most prior benchmarks)
  • Multi-format instructions for flexible downstream training
  • Automated, scalable pipeline—no crowdsourcing or manual curation

This enables unified pretraining and instruction-tuning across object detection, counting, SBIR, and multimodal VQA, facilitating the development and unbiased evaluation of sketch-aware LVLMs (Gupta et al., 18 Nov 2025).

6. Access, License, and Use Recommendations

Dataset access is provided through project repositories and academic homepages. The original SketchVCL (multi-view) is available via [https://cgm.cityu.edu.hk/projects/SketchDesc/SketchVCL.zip] and is released under CC-BY-NC 4.0 (dataset) and MIT (code). The LVLM-era SketchVCL is distributed with full curation code and typically under a comparable open license for non-commercial research.

Preprocessing is minimal: all images are 480×480 (or as standard in the synthetic pipeline). For descriptor training, patches are multi-scale extracted and normalized (zero-mean, unit-variance, as in L2-Net). Two patch sampling modes—AND-sampling and OR-sampling—are defined for training flexibility.

Recommended practice is to obey the provided data splits to avoid shape and view leakage, fine-tune both multimodal projectors and LLM heads, and randomize prompts across training epochs. Tail-class balancing and class-wise minimum coverage are systematically enforced.

SketchVCL thus constitutes a cornerstone resource for research in sketch-based correspondence, retrieval, and vision-language modeling (Yu et al., 2020, Gupta et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SketchVCL Dataset.