MiMo-7B: Multi-Modal 7B Transformer Suite

Updated 29 May 2026

MiMo-7B is a suite of open-source 7-billion-parameter transformer models designed for language, vision, and audio tasks with scalable multi-modal reasoning.
It incorporates innovative architecture elements like pre-RMSNorm, SwiGLU activations, and grouped-query attention to enable rapid inference and extended context processing.
Advanced pre-training strategies combined with reinforcement learning optimization deliver state-of-the-art performance in complex reasoning and diverse benchmark tests.

MiMo-7B denotes a collection of large, open-source transformer models at the 7-billion-parameter scale, developed by the Xiaomi MiMo team, optimized for high-level reasoning across text, vision, and audio. It is the central backbone in three major lines: MiMo-7B for language modeling and code/math reasoning, MiMo-VL-7B for vision–language tasks, and MiMo-Audio-7B for speech/audio understanding and generation. The MiMo-7B series is notable for combining innovations in pre-training data curation, architectural efficiency, multi-modal alignment, and reinforcement learning (RL)-based post-training, consistently achieving state-of-the-art performance among open-source systems in complex reasoning, multimodal grounding, and rapid inference.

1. Architecture and Core Modules

MiMo-7B architecture is based on deep decoder-only transformers, tailored for efficient scalable reasoning and multi-modal fusion. All core models employ approximately 36 transformer layers, a hidden size of 4096, and ~32 attention heads. Distinctive architectural features across the models include pre-RMSNorm and SwiGLU activations, grouped-query attention (GQA) for accelerated inference, and sliding-window or extended rotary (RoPE/MRoPE) positional embeddings to enable context windows up to 32 768 tokens for text and ~8 k for multi-modal.

Vision-Language Instantiation (MiMo-VL-7B (Team et al., 4 Jun 2025)):

Three modules: a native-resolution Vision Transformer (ViT), a 2-layer MLP projector (vision-to-language embedding), and a causal transformer LLM.
The ViT leverages 32 layers, 16 heads, and a hidden dimension of 1280. Projected visual tokens are prepended or interleaved with text tokens at the LLM input, enabling single-stream cross-modal attention.
MLP Projector warmup and MRoPE support stable alignment and long-context fusion.

Audio-Language Instantiation (MiMo-Audio-7B (Team et al., 29 Dec 2025)):

An audio tokenizer (1.2 B parameters) producing 8 quantized tokens per 40 ms frame using RVQ; a 6-layer patch encoder; a 36-layer LLM; and a 16-layer autoregressive patch decoder for speech generation.
Audio and text tokens are interleaved in the LLM input, supporting joint learning of speech-text mappings.

2. Pre-training Data and Strategies

MiMo-7B applies aggressive data scaling and curation to enhance reasoning competencies and generalizability.

Language/Coding Pre-training (MiMo-7B-Base (Xiaomi et al., 12 May 2025)):

Up to 25 trillion tokens, mixing high-quality multi-domain, STEM-focused, code, and synthetic reasoning data.
Three-stage mixing: (1) general multi-domain data (~8 k tokens/context), (2) 70% STEM/code upsampling, and (3) 10% synthetic reasoning (LLM-generated chain-of-thought).
Global deduplication (MinHash, URL, semantic scoring) ensures non-trivial, high-quality samples.

Vision–Language Pre-training (MiMo-VL-7B (Team et al., 4 Jun 2025)):

Four sequential stages (total 2.4 T tokens):
1. MLP Projector warmup (image–caption pairs, 300 B tokens, 8 k context).
2. Vision–language alignment (web/book images with text).
3. Multimodal and GUI/grounding/CoT infusion (1.4 T tokens).
4. Long-context SFT (text, long-form CoT, high-res images/videos, 32 k context).
Phash-based deduplication prevents test leakage.

Audio Pre-training (MiMo-Audio-7B (Team et al., 29 Dec 2025)):

Over 100 million hours of curated audio: podcasts, news, interviews, audiobooks, and in-the-wild sources, paired when possible with corresponding transcripts.
Audio tokens are derived via the MiMo-Audio-Tokenizer; patch-level representations are fed alongside text into the LLM for joint causal modeling.

3. Reasoning Optimization and Multi-Token Prediction

MiMo-7B explicitly targets mathematical, coding, and logical reasoning via both curriculum design and auxiliary objectives:

Multi-Token Prediction (MTP) in MiMo-7B (Xiaomi et al., 12 May 2025): Adds a loss on predicting future tokens up to M steps ahead

$\mathcal{L}_\mathrm{total} = \mathcal{L}_\mathrm{next} + \lambda_\mathrm{MTP}\cdot\mathcal{L}_\mathrm{MTP}$

and deploys special speculative decoding. After training, parallel MTP layers are fine-tuned, yielding 1.8–2.2× acceleration for chain-of-thought (CoT) inference at minimal degradation.

Chain-of-Thought Data: Synthetic CoT (long-form, multi-step) is directly injected during pre-training and long-context SFT. Experiments (Team et al., 4 Jun 2025) demonstrate linear, un-saturated improvements in reasoning metrics as CoT volume increases.

4. Reinforcement Learning and Post-Training Enhancement

MiMo-7B employs advanced RL schemes to surpass purely supervised LLMs:

Group Relative Policy Optimization (GRPO) is the principal method for on-policy RL in both text and multimodal settings (Xiaomi et al., 12 May 2025, Team et al., 4 Jun 2025, Li et al., 19 Dec 2025). Groups of sampled outputs are rewarded based on normalized advantage within group, often omitting KL penalties for improved exploration. Soft, test-difficulty–aware reward signals are used to densify gradients on hard problems. Dynamic easy-data resampling stabilizes RL convergence, maintaining performance across all skill ranges.
Mixed RL for Multimodal: MiMo-VL-7B and MiMo-VL-Miloco-7B combine verifiable (rule-based, e.g., Math-Verify, IoU for GUI/temporal grounding) and human-preference (RLHF with Bradley–Terry) rewards. RL optimization is subject to domain interference: gains in reasoning may regress grounding accuracy and vice versa.

5. Domain Adaptation and Specialization

Home-centric and audio-centric variants demonstrate the extensibility of the MiMo-7B backbone via domain-adaptive pipelines:

MiMo-VL-Miloco-7B (Li et al., 19 Dec 2025): Reuses MiMo-VL-7B weights; adds SFT on proprietary home-scenario videos (gestures, activities), with token-budget–aware prompting and chain-of-thought. RL via GRPO returns generalization lost during SFT, with dense multimodal rewards.
MiMo-Audio-7B (Team et al., 29 Dec 2025): Trains with interleaved audio–text; instruction tuning and chain-of-thought integration further enhance few-shot learning and zero-shot generalization to unseen speech tasks, dialogue, and voice transfer.
Quantization: MiMo-VL-Miloco-7B-GGUF provides 4-bit quantized weights for on-device deployment, using per-channel symmetric quantization. Activations calibrated at 8 bits.

6. Evaluation Results and Benchmarks

MiMo-7B models set new open-source standards across language, vision, GUI, and audio domains. Key metrics are summarized below.

Model	Key Domain	Headline Benchmarks
MiMo-7B-Base (Xiaomi et al., 12 May 2025)	Text/Code/Math	BBH 75.2%, SuperGPQA 25.1%, LiveCodeBench v5 32.9%, AIME 2024 32.9%
MiMo-7B-RL (Xiaomi et al., 12 May 2025)	Text/Code/Math	MATH500 Pass@1 95.8%, LiveCodeBench v6 49.3%, AIME 2025 55.4%
MiMo-VL-7B-RL (Team et al., 4 Jun 2025)	Vision/GUI	OlympiadBench 59.4, OSWorld-G 56.1, MMMU 66.7, ScreenSpot-Pro 41.9
MiMo-VL-Miloco-7B (Li et al., 19 Dec 2025)	Home/Video	Daily F1 up to 99.2%, ScreenSpot v2 92.1%, MMMU-Pro 55.7, MMLU-Pro 68.5%
MiMo-Audio-7B-Base (Team et al., 29 Dec 2025)	Audio/Speech	SpeechMMLU S2S 69.1, MMAU overall 66.0, ASR/Seed-TTS WERs 1.96–5.37 (best-open)
MiMo-Audio-7B-Instruct (Team et al., 29 Dec 2025)	Audio/Dialogue	Instruct-TTS Eval EN overall 72.6, ZH overall 70.5 (open-source state-of-the-art)

MiMo-7B models consistently outperform similarly sized open-source baselines and, on selected tasks (e.g., AIME, MATH, GUI grounding), rival much larger or closed models such as OpenAI o1-mini and Gemini-2.5-Pro.

7. Implementation, Limitations, and Prospects

Open-source policy pervades all MiMo-7B family models, with pre-training and post-training code, processed data, configurations, and checkpoints released at domain-specific repositories. Quantized weights and evaluation suites are provided for reproducibility and deployment benchmarking.

Limitations:

Multi-domain RL presents interference; optimizing for one modality or reasoning format can regress others. Current work addresses multi-head policies and decoupled curricula.
Audio few-shot learning lacks robustness for complex background music and long-form sound generation; speech dialogue can suffer style, prosody, or timbre instability.
Scaling beyond 32 k context may require further architectural innovations such as sparse attention or retrieval augmentation.

Ongoing and future developments include adapter modules to mitigate domain-specific conflicts, improved RL-driven stabilization for audio/text instruction following, and further exploitation of long-form reasoning data scalability.

References:

(Xiaomi et al., 12 May 2025, Team et al., 4 Jun 2025, Li et al., 19 Dec 2025, Team et al., 29 Dec 2025)

Markdown Report Issue Upgrade to Chat

References (4)

MiMo-VL Technical Report (2025)

MiMo-Audio: Audio Language Models are Few-Shot Learners (2025)

MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining (2025)

Xiaomi MiMo-VL-Miloco Technical Report (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MiMo-7B.

MiMo-7B: Multi-Modal 7B Transformer Suite

1. Architecture and Core Modules

2. Pre-training Data and Strategies

3. Reasoning Optimization and Multi-Token Prediction

4. Reinforcement Learning and Post-Training Enhancement

5. Domain Adaptation and Specialization

6. Evaluation Results and Benchmarks

7. Implementation, Limitations, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MiMo-7B: Multi-Modal 7B Transformer Suite

1. Architecture and Core Modules

2. Pre-training Data and Strategies

3. Reasoning Optimization and Multi-Token Prediction

4. Reinforcement Learning and Post-Training Enhancement

5. Domain Adaptation and Specialization

6. Evaluation Results and Benchmarks

7. Implementation, Limitations, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research