Intermediate Visual Thoughts: IVT Paradigm

Updated 9 February 2026

Intermediate Visual Thoughts (IVTs) are explicit visual artifacts used in iterative reasoning to serve as provisional sketches, diagrams, or latent embeddings.
They enable dynamic spatial reasoning, constraint satisfaction, and self-correction by integrating visual and textual planning in a closed feedback loop.
Recent methodologies employ diffusion models, programmatic rendering, and latent embedding editing to improve accuracy and mitigate compounded hallucinations.

Intermediate visual thoughts (IVTs) are explicit, manipulable visual representations generated or maintained by intelligent models during the multi-step reasoning process. Unlike static image encodings used as a one-pass context, IVTs function as provisional “sketches,” “diagrams,” “abstracts,” or “intermediate images” in the model’s cognitive workspace, anchoring and guiding subsequent reasoning steps. IVTs may be pixel-space images, symbolic diagrams, sketches, feature maps, latent embeddings, or other structured visualizations. They serve as dynamic intermediaries between language-like planning and high-dimensional visual generation, enabling stepwise spatial reasoning, constraint satisfaction, and the correction of hallucinations. IVTs are now central to a new paradigm—often called thinking with images—in which visual artifacts are not only model outputs but core mediators of intelligent deliberation.

1. Foundational Principles and Definitions

Intermediate visual thoughts (IVTs) are formally defined as model-generated visual artifacts that are created, manipulated, or updated at intermediate steps within a composite reasoning process. In contrast to traditional vision–language pipelines, which encode an input image $I$ to a static vector $v = \Phi_V(I)$ and perform all reasoning within the language domain $x_t \sim P(x_t \mid x_{<t}, v, Q)$ (Su et al., 30 Jun 2025), the IVT paradigm interleaves visual artifacts— $z_t \in \mathcal{T}_{\text{text}} \cup \mathcal{I}_{\text{vis}}$ —such that

$z_t \sim P(z_t \mid S_t, I, Q)$

where $S_t$ is the multimodal reasoning state. IVTs may include provisional pixel blueprints from diffusion models (Yuan et al., 2 Feb 2026), explicit 2D or 3D diagrams (Borazjanizadeh et al., 14 Mar 2025, Saha et al., 21 Jan 2026), sketches and abstracts (Liu et al., 26 May 2025, Hu et al., 2024), or feature-level latent states (Cheng et al., 21 May 2025, Zhou et al., 21 Jul 2025). Their essential characteristics are:

Explicitness: IVTs are directly generated and used by the model, not hidden internal features.
Intermediate Position: IVTs occur at non-terminal steps within a chained or iterative pipeline, not merely as final outputs.
Manipulability: IVTs can be critiqued, refined, or composed—often closing a feedback loop with planning modules.
Functionality: The presence of IVTs enables grounding of logical constraints, supports self-correction, and enhances fidelity in spatial or compositional reasoning.

2. Architectures and Generation Methodologies

IVT-centric systems span a rich spectrum of architectural design, which can be organized following the three-stage cognitive autonomy schema (Su et al., 30 Jun 2025):

Stage	Mechanism	Representative Methods
Tool-driven exploration	Calls to external vision tools, code APIs	Sketchpad (Hu et al., 2024), VisuoThink (Wang et al., 12 Apr 2025)
Programmatic manipulation	Generation of code to draw or edit images	Visual Sketchpad (Hu et al., 2024), 3D Scratchpad (Saha et al., 21 Jan 2026)
Intrinsic imagination	Native interleaved vision-language generation	MVoT (Li et al., 13 Jan 2025), DeepSketcher (Zhang et al., 30 Sep 2025), SoT (Huo et al., 28 Jan 2026), Thinking with Generated Images (Chern et al., 28 May 2025)

Collaborative Loops: Autoregressive (language) and diffusion (visual) models can operate in a closed, simulate–criticize–refine cycle, with an LLM generating constraints, a diffusion simulator instantiating an IVT $R_t$ , and a Critic (often a vision–LLM) verifying satisfaction against spatial or physical requirements (Yuan et al., 2 Feb 2026).
Explicit Rendering Chains: Systems such as Visual Sketchpad expose APIs for direct drawing: lines, boxes, masks, markers, invoking specialist detectors or segmenters as needed (Hu et al., 2024).
Visual Abstracts and Sketches: Visual Abstract Thinking prompts models using sketches or edge maps that prune irrelevant detail, augmenting expressive yet compact intermediate visual representations (Liu et al., 26 May 2025).
3D Spatial Scratchpads: 3D workspaces allow models or agentic reasoning pipelines to plan placement, orientation, and camera view, with per-step renders providing visual feedback at each reasoning turn (Saha et al., 21 Jan 2026).
Latent Embedding Editors: Internal manipulation of visual-state embeddings—without requiring external rendering—yields highly aligned, compositional visual thoughts and supports differentiable end-to-end training (Zhang et al., 30 Sep 2025).

3. Mathematical and Algorithmic Formalizations

Many IVT frameworks encode the reasoning/feedback cycle via a recurrent state update: $\mathcal{S}_t = \{P_t, R_t, F_t\}$ where $P_t$ is the planning prompt, $R_t$ is the generated IVT (image), and $F_t$ is the critic's (vision-LLM's) feedback (Yuan et al., 2 Feb 2026). The interaction unfolds as:

Planner: $P_t = \mathcal{M}_{\text{plan}}(Q, F_{t-1}, H_{t-1})$
Simulator: $R_t \sim p_{\text{world}}(x \mid P_t, \mathcal{C}_t)$
Critic: $(v_t, F_t) = \mathcal{M}_{\text{critic}}(R_t, Q)$

In programmatic systems, code is generated (e.g., Matplotlib snippets), executed, and the resulting image is ingested as the next-step context along with textual state and the action chain (Borazjanizadeh et al., 14 Mar 2025).

For autoregressive transformers supporting joint text-vision token streams (Chern et al., 28 May 2025, Li et al., 13 Jan 2025), the model's prediction at each step alternates between

$\hat{z}_i = P_{\theta}(z_i | x, \hat{z}_1, \hat{v}_1, ..., \hat{z}_{i-1}, \hat{v}_{i-1})$

$\hat{v}_i = P_{\theta}(v_i | x, \hat{z}_1, \hat{v}_1, ..., \hat{z}_i, \hat{v}_{i-1})$

mirroring dual-coding theory.

Visualization-of-Thought (VoT) for text-only LLMs employs prompt engineering to directly elicit text-based ASCII or emoji grid depictions (“mental imagery”), which function as IVTs for spatial tasks (Wu et al., 2024).

4. Functional Role: Verification, Grounding, and Self-Reflection

IVTs serve several core algorithmic purposes:

Grounding Symbolic Constraints: IVTs embed symbolic constraints into pixel space, enabling structured planners to “see” subgoal outcomes (e.g., geometric constructions, piece arrangements) (Yuan et al., 2 Feb 2026, Huo et al., 28 Jan 2026).
Incremental Error Correction: Feedback from a Critic (scored $v_t \in [0,1]$ , with accompanying rationale $F_t$ ) guides planner refinement, effectively localizing error propagation before it accumulates (Yuan et al., 2 Feb 2026).
Compositional Integration: Chain-of-Images (CoI) and Visual Chain-of-Thought (Visual-CoT) pipelines insert IVTs at each reasoning step, transforming complex symbolic operation sequences into concrete pattern recognition and verification (Meng et al., 2023, Zhou et al., 4 Nov 2025).
Mitigation of Hallucinations: Extraction of “visual factual knowledge” at intermediate transformer layers, as in EVA, allows re-injection of image-derived evidence, counteracting language-prior hallucinations (Zhou et al., 21 Jul 2025).
Planning and Search: Tree-structured and graph-of-thought frameworks represent each candidate world state as a pair (textual description, rendered diagram), enabling visual self-checks, path-pruning, and backtracking (Wang et al., 12 Apr 2025, Borazjanizadeh et al., 14 Mar 2025).
Internal Manipulation: DeepSketcher demonstrates continuous-state IVT modification through an “embedding editor,” moving visual tokens according to natural-language edit actions (Zhang et al., 30 Sep 2025).

5. Empirical Impact and Benchmarking

A consistent empirical finding is that IVTs, even when simulated with gold sketches or diagrams, yield substantial accuracy gains in complex spatial, mathematical, and planning tasks relative to text-only chain-of-thought baselines. Key quantitative results:

Collaborative Thoughts achieves 100% correct region counts versus <60% for both AR-Only and Diffusion-Only on cutting tasks, and reduces angle task token cost by four orders of magnitude (Yuan et al., 2 Feb 2026).
3D Scratchpad improves text alignment on GenAI-Bench by 32% (0.83 vs 0.63), with pronounced benefits in spatial and compositional attributes (Saha et al., 21 Jan 2026).
MIRA benchmarks report a mean relative accuracy gain of 33.7% when gold IVTs are provided to models; text-only CoT offers only minor or negative improvement on spatial tasks (Zhou et al., 4 Nov 2025).
DeepSketcher, with tool-free embedding editing, improves geometry and logic benchmarks by 3-8 points over the state of the art, and attention maps from IVT updates closely track programmatic intent (Zhang et al., 30 Sep 2025).
On geometry, graph, and chess benchmarks, Visual Sketchpad yields mean accuracy improvements of 12.7% (math) and 8.6% (vision) (Hu et al., 2024).
In programmatic planning domains, use of conceptual diagrams triples or quadruples plan success rates compared to textual-only or naive search baselines (Borazjanizadeh et al., 14 Mar 2025).
Visual Abstract Thinking (sketches as prompts) delivers +15.4% over no-prompt and outperforms textual CoT and tool-based approaches in visual reasoning (Liu et al., 26 May 2025).

6. Analysis of Limitations and Challenges

Despite the strong foundational case for IVTs, several limitations have emerged:

Compounding Errors: Unified multimodal models often accumulate generation errors in their IVT sequences, resulting in hallucinations, loss of alignment, or semantically inconsistent visuals. Benchmarking with Mentis Oculi demonstrates that compounded IVT errors frequently degrade overall multi-step performance; even when correct intermediate visuals are injected, models may fail to leverage them in decision making (Zeller et al., 2 Feb 2026).
Interpretation vs. Generation Gap: Models often fail not in generating plausible visuals but in utilizing them for downstream action selection. This reflects architectural deficits in deep visual–textual grounding (Zeller et al., 2 Feb 2026).
Computational Overhead: Each visual thought imposes non-trivial costs in both token budget and runtime. See (Su et al., 30 Jun 2025) for explicit discussion of token explosion and the need for efficient visual representation schemes.
Generalization Rigor: Many empirical wins arise when models are provided—or simulate—ground-truth or high-fidelity IVTs. Full autonomy in generating necessary, optimal visual intermediates remains a challenge (Zhou et al., 4 Nov 2025, Huo et al., 28 Jan 2026).

Selected ablation studies show that omitting IVTs or their schemas typically results in steep accuracy declines (from 90.2% to 58% on Blocksworld) (Borazjanizadeh et al., 14 Mar 2025). Qualitative analyses confirm that performance correlates with the clarity and conciseness of the IVT chosen (Cheng et al., 21 May 2025).

7. Future Directions and Open Problems

Contemporary research identifies several next steps and unresolved issues:

Dynamic Representation: The search for optimal IVT forms continues: explicit diagrams, sketches, latent maps, 3D scenes, semantic layouts, and hybrid approaches are all active research frontiers (Cheng et al., 21 May 2025, Saha et al., 21 Jan 2026).
Interleaved Planning and Perception: Closing the loop on visual planning at every chain step, especially via joint optimization or end-to-end differentiable architectures, is prioritized (Yuan et al., 2 Feb 2026, Zhang et al., 30 Sep 2025, Borazjanizadeh et al., 14 Mar 2025).
Benchmarks and Evaluation: New benchmarks require the autonomous generation and use of IVTs for compositional reasoning, process-level reward models, and step-fidelity evaluation (Zhou et al., 4 Nov 2025, 2606.23918).
Rigorous Error Analysis: Understanding, quantifying, and mitigating compounding IVT errors—both in generation and in utilization—remains essential for progress toward robust, human-aligned multimodal intelligence (Zeller et al., 2 Feb 2026).
Cross-modal Causality and Simulation: Extension of IVTs beyond isolated reasoning into domains requiring visual causal modeling (e.g., dynamics, physics, real-world interaction) is underexplored (2606.23918, Chern et al., 28 May 2025).
Unified Architectures: There is an urgent need for native, closed-loop, vision-language frameworks—beyond modular tool invocation—that support efficient, coherent visual–textual chains of thought at scale (Su et al., 30 Jun 2025, Zhang et al., 30 Sep 2025, Huo et al., 28 Jan 2026, Yuan et al., 2 Feb 2026).

Summary Table: Major Paradigms and Representative Methods

Paradigm	Mechanism / Key Idea	Example Paper / Framework
Tool-augmented (Stage 1)	Externally drawn, interpreted IVTs	Visual Sketchpad (Hu et al., 2024), VisuoThink (Wang et al., 12 Apr 2025)
Programmatic (Stage 2)	Code-based diagram generation	VisualScratchpad (Hu et al., 2024), 3D Scratchpad (Saha et al., 21 Jan 2026), Visualizing Thought (Borazjanizadeh et al., 14 Mar 2025)
Intrinsic imagination	End-to-end chain with visual tokens	MVoT (Li et al., 13 Jan 2025), DeepSketcher (Zhang et al., 30 Sep 2025), SoT (Huo et al., 28 Jan 2026), Thinking with Generated Images (Chern et al., 28 May 2025)

IVTs have fundamentally expanded the computational horizon of multimodal AI, providing both a target for evaluation and a vehicle for compositional, grounded intelligence. Nonetheless, unifying robust, efficient, and fully-autonomous IVT-centric reasoning remains an open frontier.

Markdown Upgrade to Chat

References (17)

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers (2025)

Reasoning with Autoregressive-Diffusion Collaborative Thoughts (2026)

Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs (2025)

3D Space as a Scratchpad for Editable Text-to-Image Generation (2026)

Visual Abstract Thinking Empowers Multimodal Reasoning (2025)

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models (2024)

Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought (2025)

Extracting Visual Facts from Intermediate Layers for Mitigating Hallucinations in Multimodal Large Language Models (2025)

VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search (2025)

10.

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought (2025)

11.

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning (2025)

12.

Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought (2026)

13.

Thinking with Generated Images (2025)

14.

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models (2024)

15.

Chain of Images for Intuitively Reasoning (2023)

16.

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought (2025)

17.

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Intermediate Visual Thoughts.

Intermediate Visual Thoughts: IVT Paradigm

1. Foundational Principles and Definitions

2. Architectures and Generation Methodologies

3. Mathematical and Algorithmic Formalizations

4. Functional Role: Verification, Grounding, and Self-Reflection

5. Empirical Impact and Benchmarking

6. Analysis of Limitations and Challenges

7. Future Directions and Open Problems

Summary Table: Major Paradigms and Representative Methods

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Intermediate Visual Thoughts: IVT Paradigm

1. Foundational Principles and Definitions

2. Architectures and Generation Methodologies

3. Mathematical and Algorithmic Formalizations

4. Functional Role: Verification, Grounding, and Self-Reflection

5. Empirical Impact and Benchmarking

6. Analysis of Limitations and Challenges

7. Future Directions and Open Problems

Summary Table: Major Paradigms and Representative Methods

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research