ThinkMorph: Unified Multimodal Reasoning
- ThinkMorph is a unified multimodal model architecture that interleaves text and visual reasoning to achieve emergent, interleaved chains-of-thought.
- It employs specialized control tokens and curated interleaved traces from diverse datasets to support precise visual manipulations and robust generalization.
- The design leverages adaptive modality switching and best-of-N sampling, yielding significant performance gains in tasks like spatial navigation and visual puzzles.
ThinkMorph is the designation for a unified multimodal model architecture and methodology for emergent interleaved chain-of-thought reasoning, distinguished by its direct enforcement of complementarity between textual and visual modalities. Developed as an empirical platform to investigate and characterize emergent intelligence in multimodal LLMs (MM LLMs), ThinkMorph is engineered for fine-grained, progressive manipulation of visual information balanced with coherent verbal logic. Its central contributions span model architecture, curated interleaved reasoning traces, quantitative performance, and observed emergent properties in multimodal cognition (Gu et al., 30 Oct 2025).
1. Foundational Principle: Complementarity in Interleaved Reasoning
ThinkMorph is rooted in the principle that text and image “thoughts” within a chain-of-thought should not merely mirror (isomorphic) or redundantly restate one another. Instead, they act as complementary steps wherein each modality fills informational gaps left by the other and jointly advances problem-solving. The architecture is designed to emulate iterative human-like “think-and-sketch” strategies: text steps hypothesize, describe, or compute aspects of a task, while image steps visualize, transform, and synthesize corresponding intermediate states.
This principle is embedded throughout data curation and model instruction—visual reasoning steps are constructed not to re-express text, but to manipulate or extend it in a task-directed fashion (e.g., bounding, overlaying, or altering images synchronized to the evolving verbal logic).
2. Model Architecture and Operational Mechanism
ThinkMorph is instantiated by supervised fine-tuning of Bagel-7B, a unified MM LLM, on a dataset of 24,990 manually curated interleaved traces. The model tokenizes both text (standard autoregressive language tokens) and images (latent codes, bracketed with <image_start>/<image_end>) within a joint stream. Crucially, specialized control tokens signal modality switching, enabling arbitrary interleaving during both training and inference.
Autoregressive production operates as follows:
where each token is cast as either text () or visual () and the choice of modality per step is governed by the evolving context plus delimiter signals. At each <image_start>, the visual decoder composes or manipulates image output conditioned on prior tokens, supporting context-sensitive, progressive visual edits.
Loss functions combine negative log-likelihood () for text tokens and mean-squared error () for image tokens, optimized jointly. All model weights are unfrozen during fine-tuning.
3. Dataset of High-Quality Interleaved Reasoning Traces
The systemic advancement in ThinkMorph is underpinned by its bespoke training data:
- Jigsaw Assembly (6K samples): Alternating verbal description of parts and visual assembly/hypothesis.
- Spatial Navigation (6K): Interleaved text layout reasoning, image solution overlay, and final verification.
- Visual Search (6.99K): Textual hints, image bounding, attribute confirmation.
- Chart Refocus (6K): Language selection of relevant chart regions, active image overlays, and computation steps.
All samples emphasize task-grounded, verifiable visual manipulations synchronized to non-redundant textual reasoning, operationalizing the complementarity principle. Verbal steps typically introduce hypothesis, deduction, or local computation; visual steps execute higher-level manipulations or highlight essential elements, providing direct evidence for the correctness of each multimodal chain.
4. Progressive Multimodal Reasoning and Visual Manipulation
ThinkMorph advances iterative multimodal cognition via progressive generation of interleaved reasoning steps:
- Text steps introduce, describe, or verify partial solutions and hypotheses.
- Image steps perform active content manipulation: e.g., drawing bounding boxes, overlaying solution paths, assembling pieces, inpainting, zooming, region cropping.
This alternation is regulated by delimiter signals, allowing flexible depth and structure in chain-of-thought traces. During inference, the architecture can autonomously determine the number and ordering of text or image thoughts, conditioned on task demands.
5. Performance Metrics and Generalization
On a suite of vision-centric and multimodal reasoning benchmarks, ThinkMorph exhibits:
| Model | Size | VSP | VisPuzzle | ChartQA | VStar | BLINK-J | MMVP | SAT | BLINK | CV-Bench |
|---|---|---|---|---|---|---|---|---|---|---|
| ThinkMorph | 7B | 75.83 | 79.00 | 78.10 | 67.02 | 72.00 | 80.33 | 52.67 | 60.07 | 80.82 |
| Bagel (base) | 7B | 0.83 | 35.00 | 61.82 | 55.49 | 67.33 | 70.33 | 44.67 | 47.66 | 76.03 |
| Qwen2.5-VL-72B | 72B | 41.83 | 40.00 | 82.03 | 85.86 | 61.33 | 82.0 | 64.67 | 61.91 | 82.54 |
| InternVL3.5-38B | 38B | 20.16 | 36.50 | 80.44 | 76.96 | 80.67 | 80.33 | 49.33 | 62.65 | 85.96 |
| Gemini 2.5 Flash | - | 59.33 | 47.00 | 83.79 | 70.68 | 66.00 | 80.33 | 56.00 | 67.49 | 85.07 |
The architecture delivers +34.7% absolute gain over the base Bagel-7B on benchmarks that reward concrete visual manipulation and interleaved reasoning. For spatial navigation, the improvement is measured at +85.8% (from 0.83% to 86.67%). On out-of-domain tasks (e.g., SAT), ThinkMorph matches or surpasses larger/proprietary models such as InternVL3.5-38B and Gemini 2.5. Notably, interleaved chains consistently outperform text-only or image-only chains by 5.33% and maintain robustness under distributional shift due to increased solution diversity via best-of-N sampling at inference.
Generalization is robust: ThinkMorph maintains high accuracy on unseen task types and altered image input, despite its limited data (24K) and moderate size (7B), exceeding much larger open and proprietary models in reasoning-centric tasks.
6. Emergent Properties of Unified Multimodal Reasoning
The paper identifies several empirically emergent capabilities:
6.1 Unseen Visual Manipulation Skills
At test time, ThinkMorph performs visual edits such as zoom-in, inpainting, multi-box generation, motion prediction, or elimination, even in scenarios not covered in training data. These manipulations are functionally and contextually matched to ongoing reasoning—requests to "zoom" or "focus" trigger appropriate visual edits.
6.2 Adaptive Modality Switching
The model autonomously chooses to drop image steps and switch to text-only reasoning in about 5% of out-of-domain cases, increasing accuracy by +7.29% on those samples. This switching is context-sensitive and task-adaptive, mirroring human reasoning strategies for efficient problem-solving.
6.3 Improved Test-Time Scaling via Solution Diversity
Best-of-N sampling of multimodal chains at inference yields monotonic or robust improvements with increasing N (+8% on BLINK-J), as solution diversity expands. In contrast, unimodal chains plateau or decline under scaling, especially in OOD regimes.
| Property | Description & Evidence |
|---|---|
| Unseen Visual Manipulation | Visual edits (zoom, inpaint, crop, etc.) contextually match step logic, emergent at test time. |
| Autonomous Mode Switching | Text-only chains selected adaptively, boosting sample efficiency and accuracy on certain tasks. |
| Better Test-Time Scaling | Interleaved chains maintain accuracy growth as N increases; unimodal chains stagnate. |
7. Implications for Multimodal Model Design and Future Work
ThinkMorph demonstrates that properly curated interleaved chain-of-thought training, emphasizing complementary rather than isomorphic traces, yields greater-than-sum multimodal intelligence. Joint reasoning by text and vision supports behaviors not accessible to either alone, including advanced manipulation, adaptive reasoning, and robust generalization.
For the field, this suggests new directions in:
- Strengthening adaptive modality selection within unified models.
- Deepening cross-modal alignment and logic propagation.
- Scaling curated interleaved data to enhance generalization and emergent capabilities.
- Further exploration of solution diversity via best-of-N sampling.
These directions are expected to shape future research in unified multimodal reasoning systems.
8. Key Equations and Technical Formulations
Chain-of-Thought Sequence Generation:
Loss Functions:
- Text:
- Image:
Test-Time Scaling with Best-of-N:
where is the -th reasoning chain sampled.
ThinkMorph establishes a comprehensive framework and empirical baseline for unified multimodal reasoning that leverages interleaved, complementary chains-of-thought. Through high-quality interleaved supervision and architectural support for flexible modality alternation, ThinkMorph attains superior performance, emergent manipulation skills, and adaptive cognition, advancing the understanding and capabilities of large-scale multimodal models for complex visual-linguistic tasks (Gu et al., 30 Oct 2025).