Sketch-in-Latents (SkiLa) in Multimodal AI

Updated 22 December 2025

Sketch-in-Latents (SkiLa) are architectures that unify textual tokens and continuous latent sketch tokens for integrated visual and language reasoning.
The framework uses autoregressive transformers to alternate seamlessly between text mode and sketch mode, enhancing spatial structure in generation.
Empirical results demonstrate improved spatial reasoning and visual fidelity, with significant gains on benchmarks compared to traditional multimodal methods.

Sketch-in-Latents (SkiLa) refers to a family of architectures and training protocols that enable artificial intelligence models, particularly multimodal LLMs (MLLMs) and diffusion generative models, to natively integrate and reason with both textual and visual "sketch" information in the latent space. This paradigm eliminates the need for explicit external visual toolkits or pixel-domain sketches during inference. Instead, SkiLa frameworks internally generate or align high-dimensional continuous latent features to serve as "visual thoughts," providing precise spatial structure for reasoning or generation. Contemporary works under this umbrella include unified reasoning in MLLMs (Tong et al., 18 Dec 2025), autoregressive visual sketchpads for spatial reasoning (Zhang et al., 28 Oct 2025), and training-free sketch-conditioned diffusion pipelines (Ding et al., 31 Aug 2024).

1. Foundational Paradigm and Motivation

SkiLa was introduced as a response to the limitations of conventional MLLMs and T2I diffusion models in scenarios requiring visually grounded reasoning or user-driven structure control. While MLLMs have demonstrated proficiency in visual recognition and text-driven reasoning, they frequently struggle with visual imagination, spatial planning, or reflecting stepwise structural thinking. Conversely, diffusion-based T2I models, despite high-fidelity synthesis, historically lack interfaces for precise structural editing aside from textual prompts.

SkiLa addresses these gaps via two central principles:

Incorporating explicit visual sketch latents directly into the reasoning or generative loop, alternating seamlessly with text tokens or textual prompts.
Binding these latents to human-interpretable semantics via reconstruction or cross-attention alignment objectives, ensuring the resulting internal sketches are structurally grounded.

This paradigm allows models to emulate human-like visual thinking: imagining, sketching, and planning in the latent space without reliance on external rendering or third-party toolkit integration (Tong et al., 18 Dec 2025, Zhang et al., 28 Oct 2025).

2. General Model Architecture and Tokenization

The core innovation of SkiLa frameworks is the unification of discrete and continuous token streams within autoregressive transformer-based architectures. The primary token types are:

Textual think tokens: $t_i \in V_\mathrm{text}$ , drawn from the standard model vocabulary, embedded via $E_\mathrm{text} \in \mathbb{R}^{|V_\mathrm{text}| \times D}$ .
Latent sketch tokens: $h_j \in \mathbb{R}^D$ , continuous high-dimensional vectors generated directly by the transformer, with no discrete codebook.
Special control tokens: Markers such as $\langle|sketch\_start|\rangle$ and $\langle|sketch\_end|\rangle$ orchestrate mode switching between textual and sketch token production.

The autoregressive process yields mixed sequences: $s_i \sim P_\theta(s \mid V_e, T, s_{<i})$ where $s_i$ may be sampled from the vocabulary in text mode or deterministically output as $h_i = f_\theta(h_{i-1})$ in sketch mode, with $f_\theta$ denoting the transformer’s state evolution (Tong et al., 18 Dec 2025).

Table 1: SkiLa Token Types

Token Type	Domain	Generation Mechanism
Text token ( $t_i$ )	$V_\mathrm{text}$	Softmax + Sample
Sketch token ( $h_j$ )	$\mathbb{R}^D$	Deterministic mapping

This direct coupling of text and continuous sketch sequences enables unified inference, where the model can alternate between verbal reasoning and internal visual sketching within a single latent space (Zhang et al., 28 Oct 2025, Tong et al., 18 Dec 2025).

3. Latent Sketch Token Generation and Visual Semantics Reconstruction

Unlike discrete image tokenization or explicit pixel-space rendering, SkiLa produces visual thoughts as sequences of continuous high-dimensional vectors. During sketch mode, the model emits $K$ consecutive hidden states, each serving as a "latent sketch token," collectively encoding the internal visual plan or structure.

Semantic grounding is enforced via a latent visual semantics reconstruction loss: given a ground-truth sketch $I_s$ (either as a user input or from data), a frozen encoder $E_s$ and task-specific MLP $proj_s$ extract visual tokens $V_s = proj_s(E_s(I_s)) \in \mathbb{R}^{K \times D}$ ; the model then minimizes

$\mathcal{L}_\mathrm{Sketch} = \frac{1}{K} \sum_{j=1}^K \left\| h_j - v_{s,j} \right\|_2^2$

either interleaved in the full generative process or in parallel with language cross-entropy: $\mathcal{L}_\mathrm{total} = \mathcal{L}_\mathrm{NTP} + \lambda_\mathrm{Sketch} \mathcal{L}_\mathrm{Sketch}$ where $\lambda_\mathrm{Sketch}$ controls reconstruction strength (typically 0.5 for $K=27$ latent tokens) (Tong et al., 18 Dec 2025).

For diffusion-guided generation, visual sketches guide cross-attention maps at each diffusion step. Explicit KL-based symmetric losses align the current step’s cross-attention for the class token with “target” maps extracted from the sketch inversion phase: $L_t(z_t^*) = \sum_{l \in L} \left[ \mathrm{KL}(M_t^{\textrm{CLS}}[l] \,\|\, M_t^{*\textrm{CLS}}[l]) + \mathrm{KL}(M_t^{*\textrm{CLS}}[l] \,\|\, M_t^{\textrm{CLS}}[l]) \right]$ Optimization proceeds by normalized gradient descent to match the denoised trajectory in the latent space (Ding et al., 31 Aug 2024).

4. Unified Inference and Reasoning Workflow

SkiLa’s inference loop dynamically alternates between textual and visual reasoning. This is implemented as:

Text mode: Model samples $t_i$ from the standard vocabulary via softmax.
Sketch mode: Model propagates hidden state transformations, directly outputting latent tokens until an end marker or token budget (e.g., $K=27$ ) is reached.
Switching mechanism: Discrete special tokens signal transitions between modes.

Pseudocode excerpt:

State ← "text"
while not end_of_answer:
    if State == "text":
        next ← sample_text_token(P_softmax(W_out·H_last))
        if next == "<|sketch_start|>":
            State ← "sketch"
    else if State == "sketch":
        next_hidden ← TransformerBlock(H_last, context=Prefix)
        emit latent_token next_hidden
        if length_sketch_tokens ≥ 2K or sample_end_marker():
            emit "<|sketch_end|>"
            State ← "text"
    Prefix.append(next or next_hidden)
    H_last ← last hidden state

(Tong et al., 18 Dec 2025)

For diffusion-based pipelines, as in (Ding et al., 31 Aug 2024), the generative phase is split into inversion and generation. During inversion, the sketch is encoded and cross-attention maps are extracted at all denoising steps; during generation, the latent is iteratively optimized toward cross-attention map agreement, then denoised.

5. Empirical Results and Analysis

Evaluation across visual reasoning and generative tasks demonstrates:

Vision-centric and general multimodal tasks: On 8 benchmarks (MMVP, MMStar, BLINK, HR_4K/8K, MME-Lite), SkiLa yields significant improvements. For example: Qwen2.5-VL 7B baseline, 47.5% overall; SkiLa (unified), 54.1% overall (+9.8%) (Tong et al., 18 Dec 2025).
Ablations: Performance maximizes with $K=27$ sketch tokens and $\lambda_\text{Sketch} = 0.5$ . Using SigLIP2 as the sketch encoder outperforms CLIP or QwenViT (Tong et al., 18 Dec 2025).
MazePlanning dataset (spatial reasoning): Gemma3 SR (Success Rate) improves from 70.0% (text-only) to 72.2% (+2.2%) with SkiLa; GPT-4o PR (Path Rate) improves from 30.7% to 39.8% (+9.1%) (Zhang et al., 28 Oct 2025).
Cross-attention-aligned generation: SkiLa preserves arbitrary sketch styles and robust object layout, outperforming plug-and-play, prompt-to-prompt, and trained adapters in structure preservation and visual plausibility (Ding et al., 31 Aug 2024).

6. Limitations and Failure Modes

Several limitations are documented:

Interpretation of abstract or incomplete sketches: In diffusion-guided versions, inability to “close” open sketches or infer missing structure leads to hallucinated or misaligned outputs (Ding et al., 31 Aug 2024).
Generalization to out-of-distribution transformations: Rotated or flipped sketches exceeding the model’s training distribution can degrade cross-attention alignment (Ding et al., 31 Aug 2024).
Long sequence degradation: Over many reasoning steps, sketches may degrade in visual fidelity or semantic alignment, particularly in high-dimension latent models (e.g., Qwen2.5-VL) (Zhang et al., 28 Oct 2025).
Comparisons with discrete-token approaches: Alternatives relying on discrete image tokens, such as Liquid, exhibit more frequent layout instability and semantic drift (Zhang et al., 28 Oct 2025).

No explicit numerical boundary adherence metrics are reported for generative settings, though IoU and user studies are mentioned as viable (Ding et al., 31 Aug 2024).

7. Future Directions and Extensions

Potential extensions for SkiLa include:

Partitioned sketch guidance: Multi-token prompts could guide spatially distinct sketch regions in diffusion, facilitating compositional generation (Ding et al., 31 Aug 2024).
Joint latent and self-attention optimization: Incorporating exemplar-based color/style transfer and learned lightweight adapters may combine training-free and fine-tuned approaches (Ding et al., 31 Aug 2024).
Robotic/visual planning applications: Latent sketches can encode internal waypoints for autonomous agents, amortizing planning without expensive pixel synthesis (Tong et al., 18 Dec 2025).
Hybrid tool integration: Generated special tokens may dynamically request external tool invocations (e.g., detectors), thereby enriching the internal visual context (Tong et al., 18 Dec 2025).
Interactive sketch-based interfaces: Users can interject hand-drawn strokes, which are projected into the latent space for continued reasoning and incremental generation (Tong et al., 18 Dec 2025).

This suggests SkiLa fundamentally reframes multimodal reasoning and generation by tightly coupling latent visual imagination with textual inference in a single space, providing an intrinsic "visual scratchpad" for neural models and setting the stage for further advances in interpretable, structured, and controlled AI systems (Tong et al., 18 Dec 2025, Zhang et al., 28 Oct 2025, Ding et al., 31 Aug 2024).