Chain-of-Visual-Thought

Updated 6 May 2026

Chain-of-Visual-Thought is a paradigm that embeds visual representations as intermediate reasoning tokens to enhance multimodal AI performance.
It leverages explicit visual goal prediction, latent token distillation, and region selection to generate human-interpretable reasoning traces.
Empirical evaluations demonstrate improved accuracy and efficiency in robotics, medical imaging, and spatial reasoning benchmarks.

Chain-of-Visual-Thought (CoVT), also known as Visual Chain-of-Thought (Visual CoT), designates a class of model architectures and learning paradigms in which visual representations—whether explicit images, latent perceptual tokens, spatial selections, or synthesized artifacts—are generated, manipulated, or referenced as intermediate steps in a complex reasoning process. CoVT generalizes the textual Chain-of-Thought strategies from LLMs to the multimodal or vision-language domain, equipping large vision-LLMs (VLMs), multimodal LLMs (MLLMs), and vision-language-action (VLA) architectures with interpretable, structured, and spatially grounded reasoning capabilities. Research in this domain demonstrates that, compared to purely textual intermediate steps, visual chains of thought more faithfully support compositional, spatial, and causal inference, substantially improve performance and sample efficiency in perception-rich tasks, and yield stepwise, human-interpretable rationales.

1. Formal Definitions and Paradigm Variants

The central idea behind Chain-of-Visual-Thought is that reasoning should not proceed solely in text, nor simply by mapping from images to answers, but should pass through one or more forms of visual intermediate step. The instantiations of this paradigm can be summarized as follows:

Explicit Visual Goal Prediction: Models autoregressively predict a future image or subgoal frame before generating a corresponding sequence of actions or next-step decisions. For instance, CoT-VLA, designed for vision-language-action models, first predicts a subgoal image $\hat{s}_{t+n}$ and then computes a sequence of actions to achieve this state, conditioned on current state and instruction (Zhao et al., 27 Mar 2025).
Hybrid or Implicit Visual Reasoning States: Large vision-LLMs can be instructed or fine-tuned to follow “visual reasoning steps” internally, either by outputting images in alternation with text (MVoT (Li et al., 13 Jan 2025)), by synthesizing key intermediate visual states (VChain (Huang et al., 6 Oct 2025)), or by driving the evolution of model hidden states via region selection and saliency (SSV-CoT (Guo et al., 21 Mar 2026)).
Grounded Reasoning Traces: Visualization-of-thought includes explicit spatial grounding (bounding boxes, segmentation masks, or flow fields), as in VisReason (Li et al., 21 Nov 2025), S-Chain for medical imaging (Le-Duc et al., 26 Oct 2025), and numerical coordinate policies (NV-CoT (Zhao et al., 27 Feb 2026)). Chains of reasoning are anchored in selected or predicted visual regions throughout multi-step inference.
Latent and Continuous Visual Tokens: Rather than outputting explicit images, models may generate compact continuous representations (e.g., distilled depth, segmentation, edge, or learned latent features), which act as “visual thoughts” interleaved with textual or control tokens (CoVT (Qin et al., 24 Nov 2025), CoCoVa (Ma et al., 4 Nov 2025)). These latent tokens are optimized to reconstruct dense supervision signals and serve as compact, interpretable caches for visual information.

The precise mathematical structure of a CoVT pipeline varies but generally follows the template: $(x_\mathrm{input}, Q) \xrightarrow{\mathrm{visual\ reasoning}} \mathrm{VT}_1 \to \mathrm{VT}_2 \to \cdots \to \mathrm{VT}_T \xrightarrow{\mathrm{downstream}} A$ where $\mathrm{VT}_i$ denotes one or more forms of visual thought (images, tokens, region selections), and $A$ is the final answer, action, or output.

2. Model Architectures and Mechanistic Implementations

Visual Chain-of-Thought in Autoregressive Transformers

A dominant class of CoVT models comprises autoregressive transformers that unify text, image, and (for VLA settings) action token streams. For example, CoT-VLA (Zhao et al., 27 Mar 2025) uses the VILA-U backbone, in which 256×256 images are quantized to 16×16×4 visual tokens (via residual quantization VQ-VAE), interleaved with text/action tokens for causal and bidirectional attention decoding. Subgoal images are generated autoregressively using causal attention, then short action sequences (chunks) are generated in parallel using full attention.

In SSV-CoT (Guo et al., 21 Mar 2026), the base MLLM is extended with a policy head and associated projectors, enabling sequential selection of region embeddings derived from a question-conditioned saliency map. After generating each CoT step, the model injects the selected region's embedding, supporting curriculum-like “from primary to secondary cues” progression in visual attention.

Other paradigms utilize explicit image generation capabilities, e.g., Chameleon-7B with token discrepancy loss for MVoT (Li et al., 13 Jan 2025), or the unified image-text generation backbone (Bagel) in Uni-CoT (Qin et al., 7 Aug 2025) and GVCoT (Yin et al., 2 Mar 2026). In these models, the backbone interleaves image understanding and synthesis, with macro-level planning steps producing intermediate visual outputs and micro-level branches iterating edits/refinements.

Visual Tokenization and Distillation

Several models, such as CoVT (Qin et al., 24 Nov 2025), distill knowledge from perception experts (SAM for segmentation, DepthAnything for depth, PIDINet for edge, DINO for semantic layout) into compact sets of continuous tokens. This “visual tokenization” enables downstream reasoning to operate efficiently in a low-dimensional, information-rich visual latent space, supporting both interpretability and dense supervision.

Grounding and Numerical Coordination

Numerical Visual Chain-of-Thought (NV-CoT (Zhao et al., 27 Feb 2026)) breaks from discrete token output: region selection is parameterized as a continuous stochastic policy in $\mathbb{R}^4$ . The model predicts bounding-box coordinates directly, trained with a Gaussian (or Laplace) distribution and compatible with standard policy-gradient RL.

Multimodal Fusion and Latent Reasoning

CoCoVa's LQ-Former (Ma et al., 4 Nov 2025) iteratively fuses a selected subset of visual tokens (from CLIP-ViT) and LLM hidden state to generate a chain of latent vectors $z_k$ , dynamically focusing attention and appending the “visual thought” prefix prior to decoding. The entire chain remains continuous throughout, with alignment enforced via contrastive InfoNCE and diffusion-based reconstruction.

3. Training Strategies and Objective Functions

CoVT frameworks commonly combine supervised pretraining and reinforcement learning, structured to optimize both visual chain generation and downstream task performance:

Supervised Fine-Tuning: For CoT-VLA (Zhao et al., 27 Mar 2025), training utilizes a combined loss

$\mathcal{L} = \mathcal{L}_\mathrm{visual} + \mathcal{L}_\mathrm{action}$

where $\mathcal{L}_\mathrm{visual}$ is a next-token cross-entropy over visual subgoal tokens, and $\mathcal{L}_\mathrm{action}$ is a cross-entropy loss over (discretized) robot action tokens.

Curriculum and Multi-Tasking: Chart summarization with V-CoT (Choi et al., 24 Feb 2025) benefits from data augmentation and curriculum progression from simple to complex visual patterns. In GVCoT (Yin et al., 2 Mar 2026), two-stage fine-tuning targets localization first (mask/region generation for visual thoughts), then conditionally generates edits, with both phases optimized by a flow-matching loss.
RL and Margin-Aware Preference Optimization: For grounding in medical images, ClinCoT (Liu et al., 1 Mar 2026) employs margin-aware direct preference optimization, ranking region-aware reasoning chains using multiple Med-LLMs as evaluators, and including a score-gap margin in the loss.
Contrastive and Reconstruction Losses: Latent visual thought spaces (CoCoVa (Ma et al., 4 Nov 2025), CoVT (Qin et al., 24 Nov 2025)) employ contrastive InfoNCE alignment with both visual and textual embeddings, and diffusion losses for reconstructing input images from final latent states.

4. Empirical Evaluations and Benchmarks

Quantitative results across a range of benchmarks establish the empirical utility of Chain-of-Visual-Thought:

Robotic Manipulation: CoT-VLA achieves an absolute +17% improvement over OpenVLA in real-world robot success rate and +6.1% in LIBERO simulation (Zhao et al., 27 Mar 2025).
Chart Summarization: V-CoT improves CIDEr score from 2.10 (no CoT) to 2.55 (+17.6%), with similar boosts in BLEU, BLEURT, and human-judged logic (Choi et al., 24 Feb 2025).
Spatial and Causal Reasoning: MVoT achieves 85.6% accuracy on the hardest FrozenLake spatial task, a +24% gain over text-only CoT (Li et al., 13 Jan 2025). SSV-CoT shows consistent gains (+2–4%) on M3CoT, ScienceQA, and math reasoning benchmarks (Guo et al., 21 Mar 2026).
Grounded Medical Reasoning: SV-CoT supervision improves both accuracy (+10–15 points) and mIoU of region localization (4→25 for ExGra-Med) on S-Chain (Le-Duc et al., 26 Oct 2025). ClinCoT matches or exceeds state-of-the-art on multiple VQA/report tasks (Liu et al., 1 Mar 2026).
Image Editing: GVCoT outperforms mask/tool-based VCoT approaches in semantic consistency and perceptual quality on SREdit-Bench, with average improvements of +0.85 and +0.86 (Yin et al., 2 Mar 2026).
Video Reasoning: VChain demonstrates jump in causality and temporal smoothness scores on VBench, with both “visual thought” and sparse LoRA tuning ablations shown essential (Huang et al., 6 Oct 2025).
Chain Verification: MM-CoT probes the ability to select visually/causally consistent reasoning chains, with top VLMs achieving 40–44% (images) and ~20% (videos) compared to humans at ~80% (Zhang et al., 9 Dec 2025).
Conciseness and Generalization: On maze-solving, concise Grounding-CoT outperforms verbose visual/image-heavy CoT in both sample efficiency and generalization, formalizing the “short is long” principle (Du et al., 27 Nov 2025).

5. Interpretability, Mechanistic Insights, and Human-Like Reasoning

CoVT methods provide multiple levels of transparency in visual reasoning:

Stepwise Interpretability: The explicit generation of intermediate visual states (images, spatial token masks, or visual-region selection pointers) offers a readable trace of what the model “attends” to or hypothesizes at each reasoning step, matching human cognitive strategies in complex spatial or causal tasks (Li et al., 13 Jan 2025, Li et al., 21 Nov 2025, Huang et al., 6 Oct 2025, Zhou et al., 4 Nov 2025).
Attention Probing: Attention analysis in transformer layers reveals that visual-thought tokens act as caches for visual information. Once generated, downstream textual reasoning steps attend primarily to these tokens, bypassing the need to repeatedly extract features from the raw image (Cheng et al., 21 May 2025).
Continuous and Discrete Visual Thought: Findings indicate that clarity and conciseness in visual-thought generation relate strongly to accuracy (Spearman’s $\rho>0.8$ , Pearson’s $(x_\mathrm{input}, Q) \xrightarrow{\mathrm{visual\ reasoning}} \mathrm{VT}_1 \to \mathrm{VT}_2 \to \cdots \to \mathrm{VT}_T \xrightarrow{\mathrm{downstream}} A$ 0 between clarity/conciseness and task performance) (Cheng et al., 21 May 2025). Compact continuous representations (20–30 tokens) can sufficiently ground dense visual reasoning while avoiding the inefficiencies of large textual or image token streams (Qin et al., 24 Nov 2025).
Grounded Explanations: Medical datasets with structured visual CoT, such as S-Chain, demonstrate that explicit linkage of reasoning steps to ROIs increases faithfulness and reduces hallucination (Le-Duc et al., 26 Oct 2025).

6. Limitations, Open Challenges, and Future Directions

Despite demonstrated progress, Chain-of-Visual-Thought methods face several limitations:

Runtime Cost: Autoregressive image token generation and multi-step visual reasoning introduce significant latency (e.g., CoT-VLA is ~7× slower than direct action-only pipelines) (Zhao et al., 27 Mar 2025).
Visual Fidelity: Subgoal images in CoT-VLA and MVoT may fall below diffusion model baselines in fine detail (Zhao et al., 27 Mar 2025, Li et al., 13 Jan 2025); keyframe interpolation in VChain can oversmooth or miss motion nuance (Huang et al., 6 Oct 2025).
Data and Annotation Cost: Fully supervised and richly grounded chain-of-visual-thought datasets require significant annotation; attempts to scale benefit from synthetic annotation pipelines (VisReason (Li et al., 21 Nov 2025), S-Chain (Le-Duc et al., 26 Oct 2025), GVCoT-Edit-Instruct (Yin et al., 2 Mar 2026)).
Generalization: For vision-centric tasks, minimal, well-structured chains (concise coordinate sequences or region picks) generalize best, with longer or more verbose visual chains risking overfitting to pattern-specific artifacts (Du et al., 27 Nov 2025).
Tool Integration: The ability for MLLMs to autonomously generate, manipulate, and interpret free-form diagrams or visual sketches in open domains remains limited. Benchmarks such as MIRA (Zhou et al., 4 Nov 2025) reveal current models cannot yet self-generate effective intermediate diagrams for complex reasoning tasks.

Potential extensions identified across papers include: leveraging faster image/token generation (consistency models, speculative decoding), integrating higher-fidelity or continuous image priors, developing free-form sketch integrations, and scaling curriculum-based pretraining. The development of benchmarks that test faithfulness, visual-grounding, and logical coherence (MM-CoT (Zhang et al., 9 Dec 2025)) will further drive architectural and algorithmic advances to ensure models not only reason plausibly, but with verifiable stepwise fidelity.

7. Representative Table: Variants and Applications

Model / Framework	Visual Thought Form	Domain / Task
CoT-VLA (Zhao et al., 27 Mar 2025)	Autoregressive subgoal images	Vision-language-action, robotics
MVoT (Li et al., 13 Jan 2025)	Interleaved text/image	Spatial/grid reasoning
CoVT (Qin et al., 24 Nov 2025)	Continuous visual tokens	General VLM, dense perception
SSV-CoT (Guo et al., 21 Mar 2026)	Saliency-guided region selection	Visual reasoning, MLLMs
NV-CoT (Zhao et al., 27 Feb 2026)	Numerical coordinates	Region grounding, MLLMs
VChain (Huang et al., 6 Oct 2025)	Sparse keyframe images	Video generation
VisReason (Li et al., 21 Nov 2025)	Crop + box selection + rationale	Multimodal VQA, reasoning
S-Chain (Le-Duc et al., 26 Oct 2025)	Bbox-anchored rationales	Medical VQA
GVCoT (Yin et al., 2 Mar 2026)	End-to-end mask images	Image editing, region localization

This taxonomy illustrates the breadth of CoVT implementations, spanning explicit image synthesis, patch/region selection, continuous compact tokens, and hybrid latent chains, across domains ranging from embodied action and robotics to document analysis, spatial inference, and medical imaging.

Chain-of-Visual-Thought frameworks have established a paradigm shift in multimodal reasoning where explicit, intermediate visual representations become first-class tokens in the reasoning pipeline. By grounding inference steps in interpretable, information-dense visual cues—whether images, regions, or latent states—these methods both unlock sample-efficient, high-fidelity performance and bring multimodal models closer to humanlike, systematic, and explainable intelligence. Key empirical findings and architectural innovations from the past two years have set the stage for ongoing advances in visual reasoning, cross-modal learning, and robust, trusted AI systems.