Mull-Tokens: Discrete Latent Reasoning
- Mull-Tokens are modality-agnostic discrete latent tokens that serve as internal scratchpads to represent free-form, nondecodable information across text, image, and speech modalities.
- They integrate into transformer architectures by prepending special tokens and leveraging a three-stage training paradigm including warm-up, finetuning, and RL refinement to boost multimodal representation.
- Empirical results demonstrate performance gains of 3-16 percentage points on benchmarks and efficiency improvements in speech recognition, underlining their significance for scalable deep learning.
Mull-Tokens are modality-agnostic, trainable discrete latent tokens designed to serve as internal “scratchpads” for intermediate reasoning in deep learning systems. Unlike conventional architectures that hard-switch between modalities (e.g., text and images), Mull-Tokens provide a shared latent space capable of representing free-form, nondecodable information invariant to input type, thus enabling robust and scalable multimodal reasoning. The paradigm has seen rapid uptake in multimodal language and vision-language modeling, chain-of-thought reinforcement learning, and efficient speech-token representations, catalyzing recent empirical and theoretical advances in cognitive architectures for artificial intelligence (Ray et al., 11 Dec 2025, Jain et al., 25 Sep 2025, Cui et al., 13 Sep 2024).
1. Formal Characterization and Latent Structure
Mull-Tokens are instantiated as a sequence of special, trainable discrete tokens prepended to (or interleaved with) standard LLM inputs. For a multimodal model, a generic input sequence at inference assumes the form , where is the text/context and the Mull-Tokens. Each Mull-Token obtains a hidden state from the transformer backbone, forming the matrix . These states participate in the full self-attention mechanism and are free to represent textual, visual, or abstract intermediate representations. Crucially, Mull-Tokens are not decoded into surface text or image but instead act as shared persistent latents, to which all subsequent computation—including autoregressive answer decoding—can attend (Ray et al., 11 Dec 2025).
2. Integration into Model Architectures
The canonical implementation in (Ray et al., 11 Dec 2025) augments a Qwen2.5-VL (7B) backbone, reserving new vocabulary slots for ‘’ special tokens and inserting them after the input question context. Raw images/videos are encoded by a frozen image encoder (ResNet or ViT). The full sequence, including Mull-Tokens, is fed into the transformer, which computes joint representations. During decoding, answer tokens attend to both the original input and the Mull-Tokens’ hidden states. Only the token vocabulary and slight changes to input composition are required; the transformer layers, encoder, and answer decoding pipeline remain architecturally untouched. The self-attention connects all positions, ensuring that Mull-Tokens act as both recipients and broadcasters of intermediate multimodal context (Ray et al., 11 Dec 2025).
3. Training Paradigms and Objective Functions
Training utilizes a three-stage curriculum:
- Stage 1: Warm-Up on Multimodal Chain-of-Thought (CoT): Interleaved sequences contain both Mull-Tokens and intermediate text or subgoal image tokens. Losses include cross-entropy (for text), cosine loss (for image mimicking), and standard autoregressive answer loss:
where , .
- Stage 2: Relaxed Finetuning: Losses focus only on maximizing final answer likelihood, dropping explicit Mull supervision:
- Stage 3: RL (GRPO) Refinement: Sequences are treated as RL trajectories (policy ). Rewards are given by task correctness, and group relative policy optimization (with KL anchor regularization) is performed to shape token reasoning chains causally linked to reward (Ray et al., 11 Dec 2025).
Alternative implementations within RLVR frameworks use mixture-of-token generation (MoT-G), passing mixture embeddings instead of hard token decisions, and redefining policy optimization over these continuous mixtures. This has been shown to substantially enhance training efficiency and solution diversity (Jain et al., 25 Sep 2025).
4. Empirical Evaluation and Benchmark Performance
Mull-Tokens provide substantial improvements across diverse spatial and visual reasoning benchmarks. On BLINK (spatial puzzle tasks), SAT-Real (action-motion), VSI-Bench (video reasoning), and ERQA (robotics), the introduction of Mull-Tokens yields, on average, +3 percentage point (pp) gains over the strongest prior baseline (Direct Answer FT), and up to +16 pp on jigsaw puzzle splits. A detailed summary table excerpt:
| Model | BLINK_MV | BLINK_RelDep | Jig | IQT | Avg(all) |
|---|---|---|---|---|---|
| Qwen2.5-VL (ZS) | 39.0 | 61.3 | 58.7 | 25.3 | 44.3 |
| Direct Answer FT | 57.1 | 87.1 | 58.7 | 30.0 | 50.9 |
| Interleave Im-Txt | 57.1 | 69.4 | 68.7 | 25.3 | 50.5 |
| Mull (stage2) | 63.2 | 83.1 | 74.0 | 32.0 | 53.9 |
| Mull + GRPO | 64.7 | 83.9 | 74.7 | 30.7 | 54.0 |
Ablation studies confirm that multimodal warm-up is essential; discrete latent tokens outperform continuous variants; and performance saturates at –$30$ tokens, with overthinking observed for larger (Ray et al., 11 Dec 2025). RL refinement further enhances performance scaling with Mull-Tokens.
5. Mechanistic Analysis and Qualitative Reasoning Traces
Concrete examined traces demonstrate that Mull-Tokens enable models to encode spatial/visual features without explicit surface-level annotation, such as latent encoding of image patch comparators or motion cues. Attention maps reveal that final Mull-Tokens specialize to abstract affordance reasoning (e.g., puzzle-edge matching, vertical motion in camera shifts). In RL and chain-of-thought contexts, Mull-Tokens—via mixture embeddings—maintain higher hidden-state entropy and support broader token exploration per step than single-token policies, creating richer and less myopic reasoning chains. Gram matrix eigenvalue entropy and token-level diversity metrics increase consistently in Mull-trained checkpoints (Jain et al., 25 Sep 2025), providing direct evidence of enhanced internal representational fidelity.
6. Cross-Domain Applications: Speech and Beyond
In multilingual ASR, “Mull-Tokens” refers to discrete SSL tokens obtained via clustering or quantization of frame-level speech embeddings (e.g., k-means cluster indices over Transformer output or EnCodec codebooks). Used as model inputs in place of traditional Fbank features, these tokens enable marked computational gains—training times reduced to ~45% (on average) of Fbank baselines and 2–3× throughput speedup during inference—while delivering equivalent or improved automatic speech recognition accuracy (e.g., –1.76% absolute, –15.70% relative test-set WER improvement, with up to –6.82% absolute gain for Polish) (Cui et al., 13 Sep 2024). Similar tokenization strategies can serve as a general template for compressing continuous modalities into efficient, model-friendly representations across audio, vision, and text.
7. Prospective Directions and Research Outlook
Future research avenues include: extending modality-agnostic Mull spaces to 3D point cloud data and environmental control (via world-modelling), optimizing token chain aggregation via learned compressive networks, and dynamic hybridization of soft (mixture) and hard reasoning phases. Investigations into “curriculum-conditioned k-scaling”—the adaptive control of the number of mixture components and Mull tokens during training—aim to balance early expressiveness with late-stage task precision (Jain et al., 25 Sep 2025, Ray et al., 11 Dec 2025). A plausible implication is that Mull-Tokens provide a tractable, interpretable latent substrate for cognitive agents, bridging the divide between symbolic reasoning and deep continuous computation.
References:
- "Mull-Tokens: Modality-Agnostic Latent Thinking" (Ray et al., 11 Dec 2025)
- "Learning to Reason with Mixture of Tokens" (Jain et al., 25 Sep 2025)
- "Exploring SSL Discrete Tokens for Multilingual ASR" (Cui et al., 13 Sep 2024)