Papers
Topics
Authors
Recent
2000 character limit reached

Mull-Tokens: Modality-Agnostic Latent Thinking (2512.10941v1)

Published 11 Dec 2025 in cs.CV and cs.AI

Abstract: Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.

Summary

  • The paper introduces a novel modality-agnostic Mull token system that unifies text and image reasoning to enhance spatial understanding.
  • The methodology employs a two-stage pre-training and fine-tuning process on paired data, resulting in up to 16% accuracy improvement on complex tasks.
  • This approach reduces computational complexity and is adaptable for real-time applications in robotics, autonomous navigation, and interactive AI.

"Mull-Tokens: Modality-Agnostic Latent Thinking" (2512.10941)

Introduction

The paper "Mull-Tokens: Modality-Agnostic Latent Thinking" introduces a novel approach to enhance spatial reasoning in multimodal tasks by utilizing a modality-agnostic latent token system known as Mull. Traditional methods rely heavily on specific tools or interleaved text-image thoughts, which can be cumbersome and inefficient. Mull provides a simplified solution, allowing models to process and represent information in both image and text modalities seamlessly. This approach significantly advances multimodal reasoning by abstractly thinking across modalities, overcoming the limitations of purely text or image-centric reasoning frameworks.

Methodology

Modality-Agnostic Mull

Mull operates on a two-stage processing paradigm inspired by latent reasoning in NLP models. The first stage involves pre-training Mull tokens using paired text-image data, anchoring each token to a relevant concept across these modalities. This enables the tokens to capture essential features required for tasks such as spatial reasoning, symbolic mapping, and visual layout comprehension. Following pre-training, Mull undergoes fine-tuning to optimize latent trajectories, removing explicit supervision constraints and allowing free-form optimization based solely on final answers.

The architecture leverages Qwen2.5-VL, a multimodal LLM, as its backbone, employing discrete Mull tokens embedded within the model's reasoning pathways. This configuration facilitates effective internal computations without necessitating the external decoding of text or images during intermediate processing. The flexibility of this system allows Mull to function as a scratchpad for internal thought processes, aligning well with both the image and text modalities without explicit modality switching. Figure 1

Figure 1: Compared to existing approaches for reasoning in text or interleaving image and text, we offer a simpler alternative - modality-agnostic thinking in the vision-language space using Mull to answer visual queries.

Results and Analysis

The evaluation of Mull across various visual reasoning benchmarks, such as BLINK, SAT-R, and VSI-Bench, demonstrates its superior performance over existing reasoning methods. Mull offers substantial improvements in reasoning accuracy, particularly in complex tasks requiring spatial manipulation and perspective-taking. For instance, Mull showed an increase in accuracy by up to 16% on complex puzzle-solving tasks compared to state-of-the-art baselines like interleaved image-text reasoning models. These results affirm Mull's efficacy in handling reasoning-intensive tasks efficiently with fewer tokens, contrasting with previous models that rely on verbose reasoning chains.

Additionally, the research investigates the effect of token count on performance. It reveals that although increasing the number of Mull tokens can initially enhance task performance, beyond a certain threshold, it may lead to diminishing returns. This emphasizes the importance of optimizing the number of active tokens to balance performance and computational efficiency. Figure 2

Figure 2: Our Mull training involves two stages inspired by approaches in latent reasoning. We first pre-train/warm-up our Mull to hold both image and text modalities depending on the context image/video and query. Next, the model free-form optimizes these Mull to achieve the final correct answer. We see that pre-training the Mull with to hold both image and text reasoning traces is key, as opposed to using them simply as extra compute without such pretraining or using text alone.

Implications and Speculations

Practically, Mull represents a pioneering step towards developing more efficient and versatile multimodal reasoning systems. By circumventing the need for modality-specific reasoning supervision or large task-specific datasets, Mull reduces computational overhead and promotes broader applicability across diverse task domains. This facilitates the development of models capable of performing complex visual reasoning in real-time, which can be paramount for applications in robotics, autonomous navigation, and interactive AI systems.

Theoretically, Mull opens several avenues for further exploration, such as extending its utility to encompass even more intricate and multidimensional data types like 3D environments and continuous video feeds. Future work could also explore refining the interpretability of Mull tokens to provide more insights into their internal reasoning processes, potentially leading to more transparent AI models.

Conclusion

The paper outlines a significant advancement in multimodal reasoning through the introduction of Mull, a modality-agnostic latent thinking framework. Mull proves to be a formidable alternative to traditional reasoning methods, offering improved accuracy and reduced complexity. As AI continues to evolve, Mull could play a crucial role in bridging the gap between human-like reasoning capabilities and machine intelligence, paving the way for more robust and adaptable AI models.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.