Interleaved Multi-modal Chain-of-Thought

Updated 23 December 2025

The paper introduces iMCoT, a reasoning framework that interleaves text and image tokens for unified, dynamic multimodal problem solving.
It utilizes transformer architectures with integrated cross-modal attention to dynamically select and fuse visual and textual information.
Empirical evaluations show significant accuracy gains and improved interpretability compared to decoupled or text-only chain-of-thought methods.

Interleaved Multi-modal Chain-of-Thought (iMCoT) is a family of reasoning architectures and methods for vision-LLMs (VLMs) in which textual and visual “thoughts” are alternated or tightly interleaved during the stepwise solution of a given multimodal task. In contrast to unimodal or decoupled multimodal reasoning, iMCoT processes maintain direct, dynamic, and complementary coordination between language and visual modalities at each intermediate step. The approach has shown systematic improvements in accuracy, interpretability, and visual grounding on complex vision-language benchmarks, and has given rise to several representative frameworks including ThinkMorph, Interleaved-modal CoT (ICoT), AIMCoT, Simple o3, and OmniDrive-R1 (Gu et al., 30 Oct 2025, Gao et al., 29 Nov 2024, Li et al., 30 Sep 2025, Wang et al., 16 Aug 2025, Zhang et al., 16 Dec 2025).

1. Conceptual Foundations and Motivation

The iMCoT paradigm is motivated by the limitations of both text-only chain-of-thought (CoT) and coarsely decoupled multimodal CoT prompting in vision-language reasoning. While pure textual CoT decomposes complex questions into interpretable pseudo-cognitive steps, it often fails to fully leverage or communicate the fine-grained information present in images. Classical multimodal CoT pipelines typically treat perception (visual feature extraction, region localization) and reasoning (textual analysis) as separate stages, using externally extracted region descriptions or attention maps as static inputs to a LLM. This decoupling impedes end-to-end optimization and leads to risks such as object hallucination, poor cross-modal alignment, and brittle heuristics for integrating visual evidence (Zhang et al., 16 Dec 2025, Gu et al., 30 Oct 2025, Gao et al., 29 Nov 2024).

iMCoT instead operationalizes the reasoning process as a unified sequence in which each intermediate step may be either:

a textual rationale or hypothesis (“text thought”), or
a concrete visual operation, patch insertion, image transformation, or region focus (“image thought”).

This iterative alternation enables VLMs to jointly select, update, and condition on those regions of the visual input that are most relevant to the evolving symbolic reasoning chain (Gu et al., 30 Oct 2025, Li et al., 30 Sep 2025, Wang et al., 16 Aug 2025).

2. Model Architectures and Integration Mechanisms

Core iMCoT frameworks are built as single (often autoregressive) transformer models that natively process and emit both text and image tokens in interleaved, variable-length sequences. There are two dominant integration mechanisms:

Visual patch/token interleaving: At each step, selected visual region embeddings (e.g., ViT patches) or edited/generated image tokens are intercalated into the input stream.
Dynamic visual tool invocation: The reasoning model can call image transformation tools (e.g., crop, magnify, overlay) conditioned on the current reasoning context, producing transformed observations fed as next-step inputs (Wang et al., 16 Aug 2025, Zhang et al., 16 Dec 2025).

In all cases, modality-specific hidden states are exchanged and fused at every transformer layer via cross-modal attention or gated mixture modules. A canonical fusion block is:

$h_t^{(\ell)} = g^{(\ell)} \odot h_t^{(\ell)} + (1-g^{(\ell)}) \odot h_v^{(\ell)}, \quad g^{(\ell)} = \sigma(W_g^{(\ell)}[h_t^{(\ell)}; h_v^{(\ell)}] + b_g^{(\ell)})$

where $h_t$ and $h_v$ are the text and vision states, $g^{(\ell)}$ is a learned gate, and $FFN$ projects the fused state at each layer (Gu et al., 30 Oct 2025).

Recent iMCoT systems such as ThinkMorph (Gu et al., 30 Oct 2025) and CMMCoT (Zhang et al., 7 Mar 2025) support both token-level interleaving and external retrieval-based memory modules for cross-image or multi-step visual deliberation, allowing stepwise dynamic comparison across images.

3. Inference Procedures and Visual Information Selection

A central challenge in iMCoT is when and how to introduce visual content into the reasoning chain. Several selection and triggering mechanisms have been proposed:

Passive attention-based selection: Visual regions are selected based on cross-modal attention heatmaps when signaled (e.g., newline tokens) during autoregressive decoding (Gao et al., 29 Nov 2024).
Active information-driven selection: AIMCoT reframes region selection as an information gain maximization problem, using active visual probing (AVP) to select image regions that maximally reduce uncertainty about the next token. Candidate sets combine high-attention and exploratory patches, and a greedy, near-submodular selection process is applied dynamically based on attention shift statistics (Li et al., 30 Sep 2025).
Reinforcement-driven action selection: OmniDrive-R1 (Zhang et al., 16 Dec 2025) integrates zoom-in or region-crop actions as explicit steps within policy learning, using RL with a CLIP-based process grounding reward.

The overall iMCoT inference process alternates between generating reasoning tokens and injecting selected visual tokens or transformed image states when the context warrants additional visual grounding. Dynamic triggers such as attention-shift metrics are used to initiate visual evidence insertion only when the model's cognitive demand indicates.

4. Training Objectives and Data Generation

Training iMCoT models requires data with densely annotated interleaved reasoning traces, which specify ordered sequences of textual rationales and visual operations/edit representations. Datasets such as those used by ThinkMorph (24,990 traces), Simple o3 (TWI-Tools-146K), and CMMCoT (CMMCoT-260K) are generated through a combination of:

prompting LLMs or MLLMs with “observe–reason–act” templates,
executing model- or tool-specified visual transformations,
rigorous verification and answer-level consistency checking using separate models (Gu et al., 30 Oct 2025, Wang et al., 16 Aug 2025, Zhang et al., 7 Mar 2025).

The loss is typically a composite of next-token negative log-likelihood (for textual outputs), image token reconstruction (MSE or diffusion losses for visual output), and auxiliary alignment losses such as mean-squared-error between projected visual and symbolic hidden states at each interleaved step (Gu et al., 30 Oct 2025, Zhang et al., 7 Mar 2025). In RL-based settings (e.g., OmniDrive-R1), process and outcome-based rewards based on CLIP similarity and downstream task accuracy are used (Zhang et al., 16 Dec 2025).

5. Empirical Performance, Emergent Capabilities, and Interpretability

Across a range of benchmarks (VQA, chart reasoning, scene navigation, multi-hop QA, fine-grained perception), iMCoT frameworks consistently outperform text-only CoT and pipeline-based multimodal CoT. For example, ThinkMorph yields an absolute accuracy improvement of +34.7% on vision-centric tasks over Bagel-7B (Gu et al., 30 Oct 2025); Simple o3 delivers +7.4 to +12.9 improvement on HR-4K and V*Bench over base Qwen2.5-VL-7B (Wang et al., 16 Aug 2025); AIMCoT surpasses prior ICoT methods by +5.5% (M3CoT), +4.08% (ScienceQA), and +18.25% (LLaVA-W ROUGE-L) (Li et al., 30 Sep 2025).

Key emergent properties include:

Unsupervised visual actions: OmniDrive-R1 and ThinkMorph perform visual manipulations such as zoom, inpainting, and overlay without explicit supervision (Zhang et al., 16 Dec 2025, Gu et al., 30 Oct 2025).
Autonomous mode selection: Models self-select unimodal reasoning steps when multisensory input is redundant, improving sample efficiency (Gu et al., 30 Oct 2025).
Interpretable traces: Each reasoning step is verifiably grounded to specific pixels or regions; user studies report reduced hallucination and more transparent model deliberation (Gao et al., 29 Nov 2024, Wang et al., 16 Aug 2025).

6. Limitations, Theoretical Insights, and Future Directions

Despite empirical gains, open challenges persist:

Adaptive interleaving: Current models often insert visual evidence based on fixed or heuristic triggers; learning adaptive, task-aware interleaving remains open (Gu et al., 30 Oct 2025, Li et al., 30 Sep 2025).
Richness and expressivity: Most methods focus on bounding boxes, overlays, or basic image edits. Extending iMCoT to support sketching, segmentation, or simulation-based thoughts is an unresolved direction (Gu et al., 30 Oct 2025, Wang et al., 16 Aug 2025).
Efficiency trade-offs: Interleaving doubles token throughput, increasing decoding time and compute requirements by 30–50% (Lin et al., 17 Feb 2025).
Cross-modal verification: Practical, automatic consistency metrics for ensuring visual tokens truly advance the symbolic logic are under-developed (Gu et al., 30 Oct 2025, Zhang et al., 7 Mar 2025).
Interactive co-thinking: Enabling human–AI collaborative reasoning with stepwise interventions is largely unexplored.

Theoretical analyses (e.g., clarity/conciseness metrics, attention flow ablations) reveal that the performance of iMCoT depends directly on the succinctness and informativeness of the visual thoughts; models that generate clear, minimal, and highly diagnostic visual steps yield greater reasoning improvements (Cheng et al., 21 May 2025).

7. Comparative Summary of iMCoT Methods

Framework	Visual Step Triggering	Selection Strategy
ThinkMorph	Autoregressive, trained signal	Manual curation, task-specific
ICoT (ICoT)	Signal token (e.g. newline)	Attention-driven Top-K (ADS)
AIMCoT	Attention shift (DAT)	Info-gain maximization (AVP)
Simple o3	Tool invocation, action step	Reason-act cycle, tool intent
OmniDrive-R1	Policy learning (RL)	CLIP GRPO, action sampling

Performance results and qualitative analyses uniformly demonstrate that active, purpose-driven information foraging and precise integration yield substantially higher accuracy, sample efficiency, and interpretability compared to text-only or passively interleaved methods (Gu et al., 30 Oct 2025, Li et al., 30 Sep 2025, Wang et al., 16 Aug 2025, Zhang et al., 16 Dec 2025, Cheng et al., 21 May 2025).

References

"ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning" (Gu et al., 30 Oct 2025)
"Interleaved-Modal Chain-of-Thought" (Gao et al., 29 Nov 2024)
"AIMCoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning" (Li et al., 30 Sep 2025)
"Simple o3: Towards Interleaved Vision-Language Reasoning" (Wang et al., 16 Aug 2025)
"OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving" (Zhang et al., 16 Dec 2025)
"Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought" (Cheng et al., 21 May 2025)
"Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study" (Lin et al., 17 Feb 2025)
"CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation" (Zhang et al., 7 Mar 2025)