Papers
Topics
Authors
Recent
Search
2000 character limit reached

MiMo-7B: Multi-Modal 7B Transformer Suite

Updated 29 May 2026
  • MiMo-7B is a suite of open-source 7-billion-parameter transformer models designed for language, vision, and audio tasks with scalable multi-modal reasoning.
  • It incorporates innovative architecture elements like pre-RMSNorm, SwiGLU activations, and grouped-query attention to enable rapid inference and extended context processing.
  • Advanced pre-training strategies combined with reinforcement learning optimization deliver state-of-the-art performance in complex reasoning and diverse benchmark tests.

MiMo-7B denotes a collection of large, open-source transformer models at the 7-billion-parameter scale, developed by the Xiaomi MiMo team, optimized for high-level reasoning across text, vision, and audio. It is the central backbone in three major lines: MiMo-7B for language modeling and code/math reasoning, MiMo-VL-7B for vision–language tasks, and MiMo-Audio-7B for speech/audio understanding and generation. The MiMo-7B series is notable for combining innovations in pre-training data curation, architectural efficiency, multi-modal alignment, and reinforcement learning (RL)-based post-training, consistently achieving state-of-the-art performance among open-source systems in complex reasoning, multimodal grounding, and rapid inference.

1. Architecture and Core Modules

MiMo-7B architecture is based on deep decoder-only transformers, tailored for efficient scalable reasoning and multi-modal fusion. All core models employ approximately 36 transformer layers, a hidden size of 4096, and ~32 attention heads. Distinctive architectural features across the models include pre-RMSNorm and SwiGLU activations, grouped-query attention (GQA) for accelerated inference, and sliding-window or extended rotary (RoPE/MRoPE) positional embeddings to enable context windows up to 32 768 tokens for text and ~8 k for multi-modal.

Vision-Language Instantiation (MiMo-VL-7B (Team et al., 4 Jun 2025)):

  • Three modules: a native-resolution Vision Transformer (ViT), a 2-layer MLP projector (vision-to-language embedding), and a causal transformer LLM.
  • The ViT leverages 32 layers, 16 heads, and a hidden dimension of 1280. Projected visual tokens are prepended or interleaved with text tokens at the LLM input, enabling single-stream cross-modal attention.
  • MLP Projector warmup and MRoPE support stable alignment and long-context fusion.

Audio-Language Instantiation (MiMo-Audio-7B (Team et al., 29 Dec 2025)):

  • An audio tokenizer (1.2 B parameters) producing 8 quantized tokens per 40 ms frame using RVQ; a 6-layer patch encoder; a 36-layer LLM; and a 16-layer autoregressive patch decoder for speech generation.
  • Audio and text tokens are interleaved in the LLM input, supporting joint learning of speech-text mappings.

2. Pre-training Data and Strategies

MiMo-7B applies aggressive data scaling and curation to enhance reasoning competencies and generalizability.

Language/Coding Pre-training (MiMo-7B-Base (Xiaomi et al., 12 May 2025)):

  • Up to 25 trillion tokens, mixing high-quality multi-domain, STEM-focused, code, and synthetic reasoning data.
  • Three-stage mixing: (1) general multi-domain data (~8 k tokens/context), (2) 70% STEM/code upsampling, and (3) 10% synthetic reasoning (LLM-generated chain-of-thought).
  • Global deduplication (MinHash, URL, semantic scoring) ensures non-trivial, high-quality samples.

Vision–Language Pre-training (MiMo-VL-7B (Team et al., 4 Jun 2025)):

  • Four sequential stages (total 2.4 T tokens):

    1. MLP Projector warmup (image–caption pairs, 300 B tokens, 8 k context).
    2. Vision–language alignment (web/book images with text).
    3. Multimodal and GUI/grounding/CoT infusion (1.4 T tokens).
    4. Long-context SFT (text, long-form CoT, high-res images/videos, 32 k context).
  • Phash-based deduplication prevents test leakage.

Audio Pre-training (MiMo-Audio-7B (Team et al., 29 Dec 2025)):

  • Over 100 million hours of curated audio: podcasts, news, interviews, audiobooks, and in-the-wild sources, paired when possible with corresponding transcripts.
  • Audio tokens are derived via the MiMo-Audio-Tokenizer; patch-level representations are fed alongside text into the LLM for joint causal modeling.

3. Reasoning Optimization and Multi-Token Prediction

MiMo-7B explicitly targets mathematical, coding, and logical reasoning via both curriculum design and auxiliary objectives:

Ltotal=Lnext+λMTP⋅LMTP\mathcal{L}_\mathrm{total} = \mathcal{L}_\mathrm{next} + \lambda_\mathrm{MTP}\cdot\mathcal{L}_\mathrm{MTP}

and deploys special speculative decoding. After training, parallel MTP layers are fine-tuned, yielding 1.8–2.2× acceleration for chain-of-thought (CoT) inference at minimal degradation.

  • Chain-of-Thought Data: Synthetic CoT (long-form, multi-step) is directly injected during pre-training and long-context SFT. Experiments (Team et al., 4 Jun 2025) demonstrate linear, un-saturated improvements in reasoning metrics as CoT volume increases.

4. Reinforcement Learning and Post-Training Enhancement

MiMo-7B employs advanced RL schemes to surpass purely supervised LLMs:

  • Group Relative Policy Optimization (GRPO) is the principal method for on-policy RL in both text and multimodal settings (Xiaomi et al., 12 May 2025, Team et al., 4 Jun 2025, Li et al., 19 Dec 2025). Groups of sampled outputs are rewarded based on normalized advantage within group, often omitting KL penalties for improved exploration. Soft, test-difficulty–aware reward signals are used to densify gradients on hard problems. Dynamic easy-data resampling stabilizes RL convergence, maintaining performance across all skill ranges.
  • Mixed RL for Multimodal: MiMo-VL-7B and MiMo-VL-Miloco-7B combine verifiable (rule-based, e.g., Math-Verify, IoU for GUI/temporal grounding) and human-preference (RLHF with Bradley–Terry) rewards. RL optimization is subject to domain interference: gains in reasoning may regress grounding accuracy and vice versa.

5. Domain Adaptation and Specialization

Home-centric and audio-centric variants demonstrate the extensibility of the MiMo-7B backbone via domain-adaptive pipelines:

6. Evaluation Results and Benchmarks

MiMo-7B models set new open-source standards across language, vision, GUI, and audio domains. Key metrics are summarized below.

Model Key Domain Headline Benchmarks
MiMo-7B-Base (Xiaomi et al., 12 May 2025) Text/Code/Math BBH 75.2%, SuperGPQA 25.1%, LiveCodeBench v5 32.9%, AIME 2024 32.9%
MiMo-7B-RL (Xiaomi et al., 12 May 2025) Text/Code/Math MATH500 Pass@1 95.8%, LiveCodeBench v6 49.3%, AIME 2025 55.4%
MiMo-VL-7B-RL (Team et al., 4 Jun 2025) Vision/GUI OlympiadBench 59.4, OSWorld-G 56.1, MMMU 66.7, ScreenSpot-Pro 41.9
MiMo-VL-Miloco-7B (Li et al., 19 Dec 2025) Home/Video Daily F1 up to 99.2%, ScreenSpot v2 92.1%, MMMU-Pro 55.7, MMLU-Pro 68.5%
MiMo-Audio-7B-Base (Team et al., 29 Dec 2025) Audio/Speech SpeechMMLU S2S 69.1, MMAU overall 66.0, ASR/Seed-TTS WERs 1.96–5.37 (best-open)
MiMo-Audio-7B-Instruct (Team et al., 29 Dec 2025) Audio/Dialogue Instruct-TTS Eval EN overall 72.6, ZH overall 70.5 (open-source state-of-the-art)

MiMo-7B models consistently outperform similarly sized open-source baselines and, on selected tasks (e.g., AIME, MATH, GUI grounding), rival much larger or closed models such as OpenAI o1-mini and Gemini-2.5-Pro.

7. Implementation, Limitations, and Prospects

Open-source policy pervades all MiMo-7B family models, with pre-training and post-training code, processed data, configurations, and checkpoints released at domain-specific repositories. Quantized weights and evaluation suites are provided for reproducibility and deployment benchmarking.

Limitations:

  • Multi-domain RL presents interference; optimizing for one modality or reasoning format can regress others. Current work addresses multi-head policies and decoupled curricula.
  • Audio few-shot learning lacks robustness for complex background music and long-form sound generation; speech dialogue can suffer style, prosody, or timbre instability.
  • Scaling beyond 32 k context may require further architectural innovations such as sparse attention or retrieval augmentation.

Ongoing and future developments include adapter modules to mitigate domain-specific conflicts, improved RL-driven stabilization for audio/text instruction following, and further exploitation of long-form reasoning data scalability.

References:

(Xiaomi et al., 12 May 2025, Team et al., 4 Jun 2025, Li et al., 19 Dec 2025, Team et al., 29 Dec 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MiMo-7B.