Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 42 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 217 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218v1)

Published 20 Jun 2025 in cs.CV and cs.AI

Abstract: Vision-LLMs (VLMs) excel at multimodal understanding, yet their text-only decoding forces them to verbalize visual reasoning, limiting performance on tasks that demand visual imagination. Recent attempts train VLMs to render explicit images, but the heavy image-generation pre-training often hinders the reasoning ability. Inspired by the way humans reason with mental imagery-the internal construction and manipulation of visual cues-we investigate whether VLMs can reason through interleaved multimodal trajectories without producing explicit images. To this end, we present a Machine Mental Imagery framework, dubbed as Mirage, which augments VLM decoding with latent visual tokens alongside ordinary text. Concretely, whenever the model chooses to ``think visually'', it recasts its hidden states as next tokens, thereby continuing a multimodal trajectory without generating pixel-level images. Begin by supervising the latent tokens through distillation from ground-truth image embeddings, we then switch to text-only supervision to make the latent trajectory align tightly with the task objective. A subsequent reinforcement learning stage further enhances the multimodal reasoning capability. Experiments on diverse benchmarks demonstrate that Mirage unlocks stronger multimodal reasoning without explicit image generation.

Summary

  • The paper introduces the Mirage framework, which integrates latent visual tokens with textual reasoning to enhance multimodal analysis.
  • The paper employs a two-stage fine-tuning paradigm that jointly supervises text and latent tokens, significantly boosting spatial and formal reasoning tasks.
  • The paper's experiments demonstrate improved performance on benchmarks such as VSP and SAT compared to traditional text-only models.

Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

The paper "Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens" presents the Mirage framework, which enhances Vision-LLMs (VLMs) by incorporating latent visual tokens to facilitate multimodal reasoning without explicit image generation.

Introduction and Motivation

VLMs have shown significant progress in multimodal understanding tasks. However, existing models typically rely on text-only decoding, which limits their performance in tasks requiring visual imagination or reasoning. The Mirage framework addresses this limitation by introducing latent visual tokens that allow VLMs to interleave visual and textual reasoning without the need for pixel-level image generation.

The Mirage approach is inspired by human mental imagery, where individuals construct and manipulate simplified visual cues internally, rather than producing detailed visual images. This simplifies the reasoning process and makes it computationally efficient.

Mirage Framework

Mirage operates in a two-stage fine-tuning paradigm:

  1. Stage 1: Joint Supervision for Latent Grounding In the first stage, Mirage supervises both text and latent visual tokens. The VLM is trained to predict the next word and reconstruct a compact latent visual vector from compressed image embeddings. This stage anchors latent tokens in the visual subspace, ensuring they contain meaningful visual cues for later reasoning steps. Figure 1

    Figure 1: Pipeline of Mirage Framework. Stage 1 jointly supervises text and latent visual tokens, grounding the latter in the visual subspace; Stage 2 drops the latent supervision, anchoring the grounded latent tokens for subsequent text generation.

  2. Stage 2: Text-Only Supervision with Latent Relaxation The second stage removes direct supervision on latent vectors, optimizing only text tokens. It allows the model to treat autoregressively generated latent embeddings as priors guiding subsequent word generation, resulting in flexible interleaved reasoning without enforcing predefined embeddings.

This framework allows VLMs to produce interleaved reasoning trajectories that blend latent visual tokens with text, enhancing the model's ability to perform complex multimodal reasoning tasks.

Experimental Validation

The experiments demonstrate Mirage's effectiveness across diverse benchmarks, including spatial reasoning tasks (VSP, BLINK-Jigsaw, SAT) and formal spatial reasoning in mathematical contexts (COMT). The evaluations reveal that Mirage significantly enhances reasoning abilities compared to text-only baselines and performs comparably with unified multimodal models requiring pixel-level supervision. Figure 2

Figure 2: Performance with Helper Images as Input Priors. We evaluate model accuracy using synthesized helper images under both zero-shot and fine-tuned settings. The results highlight the informativeness of the generated images and confirm their high data quality.

Analysis and Insights

Data Generation Quality

The paper introduces a data-generation pipeline for creating task-specific helper images used during training. These helper images significantly improve model performance by providing informative visual cues essential for multimodal reasoning. Figure 3

Figure 3: Data-generation Pipeline. For each question-answer pair, we first create a helper image with task-specific tools (here, annotate the map with arrows), then prompt a VLM to produce textual reasoning that embeds this image. The text and helper image together form the synthetic multimodal trajectory used for training.

Latent Embeddings

An analysis of latent embeddings using t-SNE visualization shows that latent tokens remain closely associated with the visual representation subspace while retaining adaptability introduced in the second training stage. This demonstrates Mirage's ability to maintain valuable visual information within flexible reasoning trajectories. Figure 4

Figure 4: Visualization of Latent Embeddings. We visualize our latent tokens along with text and image embeddings with t-SNE. Our latent tokens cluster near, yet just outside, the visual representation subspace, consistent with the two-stage training design.

Conclusion and Future Directions

Mirage offers a novel approach to enhancing VLMs by integrating latent visual tokens for interleaved multimodal reasoning. By eliminating the need for pixel-level image generation, Mirage enables efficient reasoning while maintaining rich visual information. Future research could explore the integration of Mirage into unified models and its application to a broader range of multimodal and textual tasks. Additionally, improving the quality of synthetic multimodal trajectories remains a critical area for development.

With Mirage, the potential for deeper multimodal reasoning in VLMs is unlocked, offering promising directions for future advancements in AI reasoning capabilities.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com