MiMo-7B: Multi-Modal 7B Transformer Suite
- MiMo-7B is a suite of open-source 7-billion-parameter transformer models designed for language, vision, and audio tasks with scalable multi-modal reasoning.
- It incorporates innovative architecture elements like pre-RMSNorm, SwiGLU activations, and grouped-query attention to enable rapid inference and extended context processing.
- Advanced pre-training strategies combined with reinforcement learning optimization deliver state-of-the-art performance in complex reasoning and diverse benchmark tests.
MiMo-7B denotes a collection of large, open-source transformer models at the 7-billion-parameter scale, developed by the Xiaomi MiMo team, optimized for high-level reasoning across text, vision, and audio. It is the central backbone in three major lines: MiMo-7B for language modeling and code/math reasoning, MiMo-VL-7B for vision–language tasks, and MiMo-Audio-7B for speech/audio understanding and generation. The MiMo-7B series is notable for combining innovations in pre-training data curation, architectural efficiency, multi-modal alignment, and reinforcement learning (RL)-based post-training, consistently achieving state-of-the-art performance among open-source systems in complex reasoning, multimodal grounding, and rapid inference.
1. Architecture and Core Modules
MiMo-7B architecture is based on deep decoder-only transformers, tailored for efficient scalable reasoning and multi-modal fusion. All core models employ approximately 36 transformer layers, a hidden size of 4096, and ~32 attention heads. Distinctive architectural features across the models include pre-RMSNorm and SwiGLU activations, grouped-query attention (GQA) for accelerated inference, and sliding-window or extended rotary (RoPE/MRoPE) positional embeddings to enable context windows up to 32 768 tokens for text and ~8 k for multi-modal.
Vision-Language Instantiation (MiMo-VL-7B (Team et al., 4 Jun 2025)):
- Three modules: a native-resolution Vision Transformer (ViT), a 2-layer MLP projector (vision-to-language embedding), and a causal transformer LLM.
- The ViT leverages 32 layers, 16 heads, and a hidden dimension of 1280. Projected visual tokens are prepended or interleaved with text tokens at the LLM input, enabling single-stream cross-modal attention.
- MLP Projector warmup and MRoPE support stable alignment and long-context fusion.
Audio-Language Instantiation (MiMo-Audio-7B (Team et al., 29 Dec 2025)):
- An audio tokenizer (1.2 B parameters) producing 8 quantized tokens per 40 ms frame using RVQ; a 6-layer patch encoder; a 36-layer LLM; and a 16-layer autoregressive patch decoder for speech generation.
- Audio and text tokens are interleaved in the LLM input, supporting joint learning of speech-text mappings.
2. Pre-training Data and Strategies
MiMo-7B applies aggressive data scaling and curation to enhance reasoning competencies and generalizability.
Language/Coding Pre-training (MiMo-7B-Base (Xiaomi et al., 12 May 2025)):
- Up to 25 trillion tokens, mixing high-quality multi-domain, STEM-focused, code, and synthetic reasoning data.
- Three-stage mixing: (1) general multi-domain data (~8 k tokens/context), (2) 70% STEM/code upsampling, and (3) 10% synthetic reasoning (LLM-generated chain-of-thought).
- Global deduplication (MinHash, URL, semantic scoring) ensures non-trivial, high-quality samples.
Vision–Language Pre-training (MiMo-VL-7B (Team et al., 4 Jun 2025)):
- Four sequential stages (total 2.4 T tokens):
Phash-based deduplication prevents test leakage.
Audio Pre-training (MiMo-Audio-7B (Team et al., 29 Dec 2025)):
- Over 100 million hours of curated audio: podcasts, news, interviews, audiobooks, and in-the-wild sources, paired when possible with corresponding transcripts.
- Audio tokens are derived via the MiMo-Audio-Tokenizer; patch-level representations are fed alongside text into the LLM for joint causal modeling.
3. Reasoning Optimization and Multi-Token Prediction
MiMo-7B explicitly targets mathematical, coding, and logical reasoning via both curriculum design and auxiliary objectives:
- Multi-Token Prediction (MTP) in MiMo-7B (Xiaomi et al., 12 May 2025): Adds a loss on predicting future tokens up to M steps ahead
and deploys special speculative decoding. After training, parallel MTP layers are fine-tuned, yielding 1.8–2.2× acceleration for chain-of-thought (CoT) inference at minimal degradation.
- Chain-of-Thought Data: Synthetic CoT (long-form, multi-step) is directly injected during pre-training and long-context SFT. Experiments (Team et al., 4 Jun 2025) demonstrate linear, un-saturated improvements in reasoning metrics as CoT volume increases.
4. Reinforcement Learning and Post-Training Enhancement
MiMo-7B employs advanced RL schemes to surpass purely supervised LLMs:
- Group Relative Policy Optimization (GRPO) is the principal method for on-policy RL in both text and multimodal settings (Xiaomi et al., 12 May 2025, Team et al., 4 Jun 2025, Li et al., 19 Dec 2025). Groups of sampled outputs are rewarded based on normalized advantage within group, often omitting KL penalties for improved exploration. Soft, test-difficulty–aware reward signals are used to densify gradients on hard problems. Dynamic easy-data resampling stabilizes RL convergence, maintaining performance across all skill ranges.
- Mixed RL for Multimodal: MiMo-VL-7B and MiMo-VL-Miloco-7B combine verifiable (rule-based, e.g., Math-Verify, IoU for GUI/temporal grounding) and human-preference (RLHF with Bradley–Terry) rewards. RL optimization is subject to domain interference: gains in reasoning may regress grounding accuracy and vice versa.
5. Domain Adaptation and Specialization
Home-centric and audio-centric variants demonstrate the extensibility of the MiMo-7B backbone via domain-adaptive pipelines:
- MiMo-VL-Miloco-7B (Li et al., 19 Dec 2025): Reuses MiMo-VL-7B weights; adds SFT on proprietary home-scenario videos (gestures, activities), with token-budget–aware prompting and chain-of-thought. RL via GRPO returns generalization lost during SFT, with dense multimodal rewards.
- MiMo-Audio-7B (Team et al., 29 Dec 2025): Trains with interleaved audio–text; instruction tuning and chain-of-thought integration further enhance few-shot learning and zero-shot generalization to unseen speech tasks, dialogue, and voice transfer.
- Quantization: MiMo-VL-Miloco-7B-GGUF provides 4-bit quantized weights for on-device deployment, using per-channel symmetric quantization. Activations calibrated at 8 bits.
6. Evaluation Results and Benchmarks
MiMo-7B models set new open-source standards across language, vision, GUI, and audio domains. Key metrics are summarized below.
| Model | Key Domain | Headline Benchmarks |
|---|---|---|
| MiMo-7B-Base (Xiaomi et al., 12 May 2025) | Text/Code/Math | BBH 75.2%, SuperGPQA 25.1%, LiveCodeBench v5 32.9%, AIME 2024 32.9% |
| MiMo-7B-RL (Xiaomi et al., 12 May 2025) | Text/Code/Math | MATH500 Pass@1 95.8%, LiveCodeBench v6 49.3%, AIME 2025 55.4% |
| MiMo-VL-7B-RL (Team et al., 4 Jun 2025) | Vision/GUI | OlympiadBench 59.4, OSWorld-G 56.1, MMMU 66.7, ScreenSpot-Pro 41.9 |
| MiMo-VL-Miloco-7B (Li et al., 19 Dec 2025) | Home/Video | Daily F1 up to 99.2%, ScreenSpot v2 92.1%, MMMU-Pro 55.7, MMLU-Pro 68.5% |
| MiMo-Audio-7B-Base (Team et al., 29 Dec 2025) | Audio/Speech | SpeechMMLU S2S 69.1, MMAU overall 66.0, ASR/Seed-TTS WERs 1.96–5.37 (best-open) |
| MiMo-Audio-7B-Instruct (Team et al., 29 Dec 2025) | Audio/Dialogue | Instruct-TTS Eval EN overall 72.6, ZH overall 70.5 (open-source state-of-the-art) |
MiMo-7B models consistently outperform similarly sized open-source baselines and, on selected tasks (e.g., AIME, MATH, GUI grounding), rival much larger or closed models such as OpenAI o1-mini and Gemini-2.5-Pro.
7. Implementation, Limitations, and Prospects
Open-source policy pervades all MiMo-7B family models, with pre-training and post-training code, processed data, configurations, and checkpoints released at domain-specific repositories. Quantized weights and evaluation suites are provided for reproducibility and deployment benchmarking.
Limitations:
- Multi-domain RL presents interference; optimizing for one modality or reasoning format can regress others. Current work addresses multi-head policies and decoupled curricula.
- Audio few-shot learning lacks robustness for complex background music and long-form sound generation; speech dialogue can suffer style, prosody, or timbre instability.
- Scaling beyond 32 k context may require further architectural innovations such as sparse attention or retrieval augmentation.
Ongoing and future developments include adapter modules to mitigate domain-specific conflicts, improved RL-driven stabilization for audio/text instruction following, and further exploitation of long-form reasoning data scalability.
References:
(Xiaomi et al., 12 May 2025, Team et al., 4 Jun 2025, Li et al., 19 Dec 2025, Team et al., 29 Dec 2025)