Papers
Topics
Authors
Recent
Search
2000 character limit reached

SketchVL: Multimodal Sketch-Based Reasoning

Updated 16 January 2026
  • SketchVL is a family of multimodal models that leverage hand-drawn sketches to externalize intermediate visual reasoning, enabling improved cross-modal alignment.
  • It employs architectures like SVANet that bridge the gap between schematic queries and photorealistic video through cross-modal transformers and token-based matching.
  • In chart understanding, SketchVL uses fine-grained reinforcement learning with visible sketch actions for precise credit assignment, leading to notable accuracy gains.

SketchVL refers to a family of multimodal models and frameworks that leverage sketch-based or sketch-assisted reasoning for complex visual tasks. The term has emerged in independent streams of research addressing two main domains: (i) sketch-based video object localization (Woo et al., 2023), and (ii) fine-grained credit assignment in chart understanding via reinforcement learning (Huang et al., 9 Jan 2026). These works converge on the principle of externalizing intermediate visual reasoning steps as sketches or annotations, fostering enhanced alignment between modalities and improved reasoning performance.

1. Motivation and Problem Domains

Two principal lines of inquiry define SketchVL. In video object localization, the Sketch-based Video Object Localization (SVOL) task seeks to localize spatio-temporal object boxes in video using a free-hand sketch as the query, rather than a conventional class label or textual description. The central challenge is bridging the considerable domain gap between schematic, style-varying hand-drawn sketches and photorealistic video content, while simultaneously managing temporal dynamics and multi-object scenarios (Woo et al., 2023).

Conversely, in chart and diagram reasoning, existing multimodal LLMs (MLLMs) exhibit brittle performance, largely attributable to coarse-grained reinforcement learning signals that fail to assign credit or blame to specific token-level reasoning steps. This is especially problematic in high-density visual reasoning settings, such as chart question answering, where a sequence of sub-decisions is required and a single mistake can have cascading effects (Huang et al., 9 Jan 2026). SketchVL addresses this by introducing visible sketch actions at each step and by optimizing for fine-grained credit assignment.

2. Methodological Frameworks

2.1 Sketch-based Video Object Localization

Let s∈RHs×Wss \in \mathbb R^{H_s \times W_s} denote a sketch query, and V={ft}t=1TV = \{ f_t \}_{t=1}^T, ft∈R3×H×Wf_t \in \mathbb R^{3 \times H \times W}, a TT-frame RGB video. At each frame tt, KtK_t ground-truth bounding boxes Bt={btj}j=1KtB_t = \{ b_t^j \}_{j=1}^{K_t} are specified, with btjb_t^j parameterized in normalized coordinates. The objective is to predict, for every frame tt, a set of MM proposal boxes B^t={b^ti}i=1M\hat B_t = \{ \hat b_t^i \}_{i=1}^M and associated objectness scores o^ti∈[0,1]\hat o_t^i \in [0,1], such that for each ground-truth box, there exists at least one predicted box with IoU(btj,b^ti)≥μ\mathrm{IoU}(b_t^j, \hat b_t^i) \ge \mu. The evaluation metrics are Recall@kk at IoU≥μ\mathrm{IoU}\ge\mu and mean IoU (mIoU) across matched pairs (Woo et al., 2023).

2.2 Fine-Grained Chart Understanding with RL

Chart understanding models like SketchVL use a sketch-on-image interface within a multi-step reasoning process. At each step, the model outputs (i) an intent—a textual description of what to mark on an image, and (ii) an action—a drawing primitive with normalized coordinates (such as point, rectangle, or line). The model then renders the annotation on the image, which is recursively fed back to itself for subsequent reasoning steps. This process externalizes the reasoning trajectory as a visible chain of sketches, culminating in the final answer (Huang et al., 9 Jan 2026).

3. Architectural Innovations

3.1 SVANet for SVOL

The Sketch-Video Attention Network (SVANet) architecture comprises:

  • Video Encoder: ResNet-50, producing per-frame feature maps v(0)∈RT×C×H′×W′v^{(0)} \in \mathbb R^{T \times C \times H' \times W'}.
  • Sketch Encoder: ResNet-18 with global pooling, yielding the sketch embedding s∈RCs \in \mathbb R^C.
  • Object Tokens: N=Tâ‹…MN = T \cdot M learnable tokens R(0)∈RN×CR^{(0)} \in \mathbb R^{N \times C}, partitioned per frame.
  • Cross-Modal Transformer (CMT): Two stacked layers, each with four multi-head attention blocks and MLPs, mediating information exchange via cross-modal and self-attention, as well as content-token cross-attention (CTCA) and sketch-video cross-attention (SVCA). The multi-head attention mechanism follows standard QKV operations with learnable projections and LayerNorm.
  • Detection Heads: R(2)R^{(2)} is linearly projected to produce bounding box coordinates b^\hat b and objectness logits o^\hat o.

The per-frame set matching and training loss utilize the Hungarian algorithm to align MM predicted tokens to KtK_t ground-truth or dummy background, employing a composite loss over negative log objectness and a combination of ℓ1\ell_1 and 1−IoU1-\mathrm{IoU} box regression (Woo et al., 2023).

3.2 SketchVL with FinePO for Chart QA

The SketchVL workflow integrates a standard MLLM backbone with:

  • Sketch-on-Image Module: Accepts and renders intent/action pairs as image overlays.
  • Iterative Multistep Loop: Each turn updates the image state, which is then re-ingested for further steps.
  • FinePO Algorithm: Policy optimization procedure that uses a Fine-Grained Process Reward Model (FinePRM) to assign discrete step-wise scores pj∈{1,2,3,4}p_j \in \{1,2,3,4\} (Unacceptable to Excellent) for each reasoning step, regularized by KL divergence to preserve action diversity. Advantages at the trajectory level A(yi)A(y_i) are redistributed at the step level A(sj)A(s_j) by computing intra-trajectory deviations and scaling to match the reward magnitude. The resulting policy-gradient update occurs per action and is regularized to the base model with a KL-term (Huang et al., 9 Jan 2026).

4. Training Procedures and Implementation

Both approaches rely on a two-phase training paradigm.

SVANet (SVOL):

  • Trained on videos from ImageNet-VID aligned with sketch queries from Sketchy, TU-Berlin, and QuickDraw datasets, focusing on overlapping categories.
  • Key architectural hyperparameters established via ablation: M=10M = 10 tokens per frame, T=32T = 32 frames, two CMT layers.

SketchVL (FinePO):

  • Supervised cold-start pretraining on 50K samples (distilled from EvoChart, GQA, ChartQA-Train).
  • RL fine-tuning with FinePO on 9K prompts from ChartQA and related datasets, with FinePRM trained on ~473K intent–action pairs, labeled by a blend of distillation, segmentation, and noise perturbation.
  • Core RL hyperparameters include k=24k=24 candidates per prompt, λKL=0.1\lambda_{KL}=0.1, γ=0.5\gamma=0.5, α=0.2\alpha=0.2, β=2.0\beta=2.0, learning rate $1e-6$.

The FinePO/RL phase is performed on 16×NVIDIA A800 GPUs using the ms-swift RL framework. The process reward regularization prevents collapse to trivial policies.

5. Empirical Results and Analysis

5.1 Quantitative Performance

SVANet on SVOL:

Method Sketchy TU-Berlin QuickDraw
CMA [ECCV'20] 19.8 18.6 21.5
Sketch-DETR 26.1 26.2 28.6
SVANet 33.8 30.9 33.4

SVANet reports an absolute mIoU gain of +7.7 (Sketchy), +4.7 (TU-Berlin), and +4.8 (QuickDraw) over the best prior. On combined datasets, mIoU is 30.6 (SVANet) versus 24.3 (Sketch-DETR) (Woo et al., 2023).

Ablations show that per-frame Hungarian matching confers an +11.4% mIoU benefit, content and token self-attention layers each add 1–2%, and optimal values for MM and TT yield further improvements.

SketchVL (FinePO) on Chart Benchmarks:

  • Absolute accuracy gains of +7.23% on average over Qwen2.5VL base model across chart and vision-language testbeds.
  • Notable metrics: EvoChart-QA (54.80 → 58.64), ChartQA (82.00 → 83.96), ChartBench (64.78 → 65.11), MathVista (61.40 → 63.50), MMStar (56.67 → 57.13).
  • The 3B parameter variant demonstrates even larger relative improvements on ChartQA (61.88 → 77.20).
  • PlotQA performance is an outlier (drop from 63.44 to 55.84), attributed to recall tolerance differences (Huang et al., 9 Jan 2026).

Ablation studies confirm the necessity of RL, fine-grained credit assignment, KL regularization, and sketch action supervision. Disabling sketches ("zero GRPO") causes a catastrophic –20 point drop.

5.2 Qualitative Insights

Visualization reveals that FinePRM accurately assesses localized sketch actions, with heatmaps confirming correct region attribution aligned to model intent. The visible, step-wise annotation chain improves both interpretability and the learning signal (Huang et al., 9 Jan 2026).

6. Transferability, Limitations, and Outlook

Domain Transfer: SVANet demonstrates robust transfer to new sketch styles and categories without explicit retraining:

  • Style transfer (trained on Sketchy, tested on TU-Berlin/QD): 49.0% mIoU (TU), 49.7% (QD) for SVANet vs 36.8/40.2% for Sketch-DETR.
  • Category transfer (zero-shot): SVANet attains 29.9% mIoU on unseen classes, outperforming Sketch-DETR (22.5%) (Woo et al., 2023).

Algorithmic Strengths: The visible reasoning process (sketch chain) in SketchVL yields interpretable, step-resolved feedback. FinePO provides low-noise, sharply targeted reinforcement, stabilizing policy learning and mitigating reward hacking (Huang et al., 9 Jan 2026).

Limitations: Reliance on large, high-quality manually or distillation-labeled datasets for training FinePRM constrains scalability. There is no standard benchmark for evaluating process-level reward models, complicating cross-study comparison.

Future Directions: The extension of fine-grained RL to broader multimodal settings—including diagram VQA and arbitrary image reasoning—forms a central research trajectory. Unsupervised or weakly supervised learning of process reward models and dynamic, contextually adaptive action spaces are identified as promising directions (Huang et al., 9 Jan 2026).

7. Contextualization in the Multimodal Literature

SketchVL (in both SVOL and chart QA settings) exemplifies a shift toward explicit, visually-grounded intermediate representations, either via cross-modal attention tokens or via iterative, sketch-based Markov decision processes. Both approaches eschew class-level supervision in favor of set-based or sequential policies, considerably expanding the expressivity and alignment of multimodal reasoning models. This signals a growing recognition that externalized, interpretable sub-decisions, coupled with fine-grained supervision and matching, are critical for robust, generalizable visual reasoning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SketchVL.