Visual Reasoning Tokens in Multimodal Models
- Visual Reasoning Tokens are discrete or continuous representations that explicitly encode mid-level visual cues (e.g., depth, segmentation) to enhance reasoning in multimodal architectures.
- Their integration improves cross-modal generalization and performance on spatial, geometric, and counting tasks by providing task-adaptive visual abstractions.
- Implementation leverages techniques like VQVAE quantization, slot-based abstractions, latent bottlenecking, and pointer mechanisms for efficient, interpretable visual processing.
Visual Reasoning Tokens are discrete or continuous representations within multimodal transformer models that serve as explicit intermediates for perception and reasoning over visual information. These tokens may encode mid-level abstractions such as depth, bounding boxes, or segmentation, or they may function as latent, task-adaptive summaries of visual input. By making visual abstraction an explicit part of the model’s chain-of-thought, visual reasoning tokens address the historical barrier in multimodal models where reasoning is otherwise confined to language tokens, coarse image global features, or black-box visual encoders. Their inclusion has enabled sharper cross-modal generalization, end-to-end multi-task tuning, and performance improvements in fine-grained spatial, geometric, counting, and 3D reasoning across a diversity of benchmarks (Bigverdi et al., 4 Dec 2024, Li et al., 29 Sep 2025, Li et al., 24 Dec 2025, Qin et al., 24 Nov 2025, Chen et al., 5 Jun 2025).
1. Formal Definitions and Taxonomy
Visual reasoning tokens can be categorized along several axes:
- Discrete vs. Continuous: Discrete tokens are quantized (e.g., via VQ-VAE or clustering), enabling compact symbol-like reasoning traces (Bigverdi et al., 4 Dec 2024). Continuous tokens are d-dimensional latent embeddings produced by the model, aligned to feature maps or expert predictions (Qin et al., 24 Nov 2025, Li et al., 29 Sep 2025).
- Intrinsic vs. Extrinsic: Some tokens are produced as intrinsic outputs of the model during autoregressive decoding (“visual CoT” tokens, registered as part of the vocabulary), while others are extracted or injected by external modules (e.g., segmentation experts, semantic proposal generators).
- Semantics: Tokens may represent mid/low-level cues (depth, edge, bounding box, mask), object-centric abstractions (“slots”), or even pointer operations over patch embeddings (“point-and-copy”) (Chung et al., 24 May 2025).
- Explicit vs. Latent: Some frameworks supervise tokens to reconstruct external signals (depth maps, expert masks) (Qin et al., 24 Nov 2025), whereas “latent visual reasoning” and “latent implicit visual reasoning” (Editor’s term) omit direct supervision and train tokens end-to-end as bottlenecks or scratchpads for task-adaptive abstraction (Li et al., 29 Sep 2025, Li et al., 24 Dec 2025, Ray et al., 11 Dec 2025).
2. Key Architectures and Implementation Strategies
Discrete Perception Tokens via VQVAE (AURORA)
Aurora extends the multimodal LLM vocabulary from to , where includes tokens representing quantized depth, pixel positions, or bounding boxes. Depth maps are encoded, vector-quantized via a codebook, and serialized as token sequences. These sequences are generated as an explicit “visual reasoning trace,” then the final answer is produced conditioned on them. The frozen vision backbone (e.g., CLIP ViT) is augmented by embedding and output heads for new token types, with LoRA adapters fine-tuned on these components (Bigverdi et al., 4 Dec 2024).
Slot and Register-Based Abstractions
Transformer-based models such as SAViR-T and Victor use attention bottlenecks to compress image patch embeddings into a small set of slot or register tokens. In SAViR-T, spatio-visual tokens are globally self-attended, then grouped and fused via small MLPs to extract rule embeddings relevant to combinatorial reasoning (e.g., Raven’s Progressive Matrices) (Sahu et al., 2022). Victor introduces learnable registers, which absorb the full set of visual tokens via cross-attention in early transformer layers; after summarization, only the registers remain, yielding computational efficiency and modest loss in accuracy (Wen et al., 17 Oct 2024).
Latent Implicit and Explicit Bottlenecking
Recent work introduces task-agnostic latent tokens as globally-attending, trainable intermediates between visual features and language outputs. In Latent Implicit Visual Reasoning (LIVR), learnable latent tokens are appended to the prompt, and the attention mask is structured so all answer tokens must pass through these latent tokens—establishing a visual information bottleneck. The entire system is then trained on end-task loss, causing the latents to absorb whatever visual abstraction is most useful per task (Li et al., 24 Dec 2025).
Continuous Visual Tokens with Expert Distillation (CoVT, LVR, Mirage)
The Chain-of-Visual-Thoughts (CoVT) framework injects 20 continuous visual tokens after the <image> marker into VLMs (e.g., Qwen2.5-VL), where each token is trained via reconstruction loss to encode segmentation, depth, edge, or DINO features. At inference, these tokens serve as an efficient latent "workspace" supporting interpretable or non-interpretable visual reasoning (Qin et al., 24 Nov 2025). LVR permits interleaving of language decoding and latent visual token generation, with explicit reconstruction loss on visual embedding targets (Li et al., 29 Sep 2025). Mirage (Machine Mental Imagery) implements pure latent visual tokens: the model autoregressively generates d-dimensional “visual” tokens at reserved positions, training them first to match pooled vision encoder features and subsequently relaxing to text-only objectives to maximize answer likelihood (Yang et al., 20 Jun 2025).
Pointer/Copy Mechanisms
In v1, “visual reasoning tokens” are semantically pointers: the model can output tokens of the form to copy an in-sequence patch embedding from the input image into the inference context. Downstream attention then reuses the exact patch representation for subsequent reasoning steps, maintaining perceptual grounding throughout extended chain-of-thoughts (Chung et al., 24 May 2025).
3. Training Objectives and Optimization
Perception and visual reasoning tokens are optimized under several composite objectives:
| Loss Component | Description | Typical Weight/Setting |
|---|---|---|
| Answer cross-entropy | Main task, | |
| Distillation from vision specialist (e.g., expert tokens) | (often 1) | |
| Auxiliary reconstruction (e.g. recons. of depth map) |
Multi-task and curriculum learning strategies are frequently employed, ordering tasks by difficulty and adjusting sampling and loss objectives adaptively (Bigverdi et al., 4 Dec 2024). For latent-token bottlenecking (LIVR), the only supervision is end-task answer likelihood. Several frameworks utilize reinforcement learning (e.g., Group Relative Policy Optimization) to further adapt the token generation strategy, especially when rewards can be formulated based on physical consistency or end-task performance (Lin et al., 22 Apr 2025, Li et al., 29 Sep 2025, Ray et al., 11 Dec 2025).
4. Applications and Empirical Outcomes
Visual reasoning tokens have proven critical for benchmarks requiring spatial, geometric, and counting capabilities beyond language-domain CoT:
- 3D Relative Depth and 2D Counting: Aurora with perception tokens yields +10.8% (BLINK), +11.3% (CVBench), +8.3% (SEED-Bench) over fine-tuned baselines, with substantial cross-task generalization (Bigverdi et al., 4 Dec 2024).
- Spatial-Relation and Puzzle Solving: Continuous and latent reasoning tokens in CoVT, LVR, Mirage, and Mull-Tokens yield consistent +3–16% absolute gains across detailed spatial, planning, and jigsaw tasks (Qin et al., 24 Nov 2025, Li et al., 29 Sep 2025, Yang et al., 20 Jun 2025, Ray et al., 11 Dec 2025).
- Mathematical Visual Reasoning: Interleaved visual tokens with dynamic region selection (MINT-CoT) achieve up to +34.1% accuracy improvements in figure-grounded mathematics compared to text-only baseline (Chen et al., 5 Jun 2025).
- Visual Grounding and Pointing: Attention-supervised visual reasoning tokens paired with KL divergence loss increase interpretability and performance on geometric tasks (e.g., line-tracing, patch-pointing, referring expressions) up to +20–30% (Esmaeilkhani et al., 16 Nov 2025).
- Efficiency: Register summarization (Victor) and token fusion (ToFu) compress visual sequences by 60% without retraining, enabling efficient large-context division and improved multi-image reasoning (Wen et al., 17 Oct 2024, Pippi et al., 6 Mar 2025).
5. Practical Engineering Guidelines
- Vocabulary Integration: New visual reasoning tokens (discrete or continuous) expand both input embedding tables and output heads in the decoder. Discrete tokens require symbol-embedding mappings; continuous tokens are handled directly as LLM hidden states.
- Backbone Selection: For slot-style or bottleneck token frameworks, convolutional-backbone encoders (e.g., ResNet) encourage more precise object-centric decomposition, which empirically benefits reasoning over standard ViT backbones (Luo et al., 2023).
- Token Budgeting: Most effective frameworks use a compact budget (10–40 tokens per example), with evidence that both too few (underrepresentation) and too many (attention diffusion, overfitting) degrade reasoning performance (Bigverdi et al., 4 Dec 2024, Qin et al., 24 Nov 2025).
- Compression and Summarization: Fusion of redundant tokens or register summarization can be applied as post-encoder, pre-LM procedures and are highly compatible with high-resolution or multi-image settings (Wen et al., 17 Oct 2024, Pippi et al., 6 Mar 2025).
- Implicit Supervision: Latent token bottlenecking (as in LIVR) requires no explicit annotation of helpful visual abstractions—tokens specialize per-task based on gradient signals from final answer losses (Li et al., 24 Dec 2025).
- Dynamic Reasoning and Inference Loop: Certain frameworks (VTS-V) cast visual token scaling and multi-step reasoning as a Markov Decision Process, coupling planning, visual module selection, and verifier-based trajectory termination (Bai et al., 8 Jun 2025).
6. Interpretability, Limitations, and Future Work
- Interpretability: Explicit perception tokens (e.g., depth, segmentation) provide transparent intermediate traces, optionally decodable to images for debugging or human inspection (Qin et al., 24 Nov 2025). Purely latent or implicit tokens are less interpretable, although attention visualizations reveal correspondence with salient image regions (Li et al., 24 Dec 2025).
- Generalizability: Implicitly supervised latents (LIVR) offer stronger cross-task and cross-domain generalization than explicit, heuristically chosen abstractions.
- Efficiency and Scaling: Register- and fusion-based approaches enable transformers to process multi-image inputs and maintain tractable runtime and memory, removing a bottleneck in real-world deployment (Wen et al., 17 Oct 2024, Pippi et al., 6 Mar 2025).
- Limitations: Extraction of semantically meaningful tokens (object masks, relationships) from pre-trained segmenters is dependent on the quality and domain coverage of these off-the-shelf models (Kalibhat et al., 26 May 2024, Zhong et al., 7 Oct 2025). Latent-token interpretability and tuning to tasks beyond spatial/structural reasoning remain open challenges.
Future directions include dynamic reasoning loops for iterative image–text interaction (Bai et al., 8 Jun 2025), unsupervised token adaptation via reinforcement learning (Ray et al., 11 Dec 2025), dynamic register or token allocation, and generalization to video, audio, or 3D world modeling (Qin et al., 24 Nov 2025, Lin et al., 22 Apr 2025).
References
- "Perception Tokens Enhance Visual Reasoning in Multimodal LLMs" (Bigverdi et al., 4 Dec 2024)
- "Latent Visual Reasoning" (Li et al., 29 Sep 2025)
- "Latent Implicit Visual Reasoning" (Li et al., 24 Dec 2025)
- "Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens" (Qin et al., 24 Nov 2025)
- "MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning" (Chen et al., 5 Jun 2025)
- "Direct Visual Grounding by Directing Attention of Visual Tokens" (Esmaeilkhani et al., 16 Nov 2025)
- "v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning" (Chung et al., 24 May 2025)
- "Efficient Vision-LLMs by Summarizing Visual Tokens into Compact Registers" (Wen et al., 17 Oct 2024)
- "ToFu: Visual Tokens Reduction via Fusion for Multi-modal, Multi-patch, Multi-image Task" (Pippi et al., 6 Mar 2025)
- "Mull-Tokens: Modality-Agnostic Latent Thinking" (Ray et al., 11 Dec 2025)