Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MiMo-VL Technical Report (2506.03569v1)

Published 4 Jun 2025 in cs.CL

Abstract: We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-LLMs delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.

Summary

  • The paper presents MiMo-VL 7B, a 7B-parameter model that integrates a native-resolution ViT encoder, an MLP projector, and a 36-layer LLM for advanced multimodal processing.
  • The paper details a four-stage pre-training strategy using 2.4 trillion tokens followed by mixed on-policy reinforcement learning to enhance visual-language alignment and reasoning.
  • The paper demonstrates state-of-the-art performance across diverse benchmarks, outperforming comparable open-source models in visual understanding, complex reasoning, and GUI interaction.

Here is a detailed summary of the "MiMo-VL Technical Report" (2506.03569):

The report introduces MiMo-VL-7B, a powerful 7B parameter vision-LLM developed by Xiaomi, released in two versions: MiMo-VL-7B-SFT (Supervised Fine-Tuning) and MiMo-VL-7B-RL (Reinforcement Learning). These models are designed as a foundational backbone for multimodal AI systems, targeting capabilities in general visual understanding, complex multimodal reasoning, and interaction with digital interfaces through GUI grounding. The report emphasizes practical implementation details, training strategies, and comprehensive evaluation results.

Architecture:

MiMo-VL-7B comprises three key components:

  1. A native-resolution Vision Transformer (ViT) encoder, based on Qwen2.5-ViT [Qwen25VL], to handle fine-grained visual details.
  2. A Multi-Layer Perceptron (MLP) projector for aligning the visual encodings with the LLM's latent space. This projector is initialized randomly.
  3. A LLM backbone, initialized from MiMo-7B-Base [xia2025mimo], which is optimized for complex reasoning tasks. The LLM configuration includes 36 layers, a hidden size of 4096, and an intermediate size of 11008, differing slightly from other 7B LLMs like Qwen2.5-VL-7B.

Training Methodology:

The development involves a two-phase process:

  1. Four-Stage Pre-training: This phase consumes 2.4 trillion tokens and strategically uses a diverse, high-quality multimodal dataset curated from open-source and synthetic sources. The data includes image captions, interleaved image-text data, OCR data, grounding data, video content, GUI interactions, reasoning examples (with long Chain-of-Thought), and text-only sequences. Image deduplication using phash is performed to prevent data contamination.
    • Stage 1 (Projector Warmup): Freezes ViT and LLM, trains only the MLP projector using image-caption pairs to establish initial visual-language alignment.
    • Stage 2 (Vision-Language Alignment): Unfreezes ViT and projector, uses interleaved data to strengthen visual-language connections and improve ViT robustness.
    • Stage 3 (General Multimodal Pre-training): All parameters are trainable. Introduces a broad mix of data and tasks (OCR, grounding, video, GUI, QA, instruction, reasoning, pure text) to build general multimodal capabilities.
    • Stage 4 (Long-context SFT): Extends sequence length from 8K to 32K tokens. Incorporates long-context data types (long text, high-res images, long documents, extended videos, long reasoning data). Significantly increases the proportion of reasoning data, including long-form reasoning patterns. This stage is found to be crucial for boosting reasoning abilities. This stage produces MiMo-VL-7B-SFT.
  2. Post-training (Mixed On-policy Reinforcement Learning - MORL): This phase builds upon the SFT model to further enhance performance, particularly on challenging tasks and human preference alignment. MORL integrates two types of RL:
    • Reinforcement Learning with Verifiable Rewards (RLVR): Uses rule-based reward functions for tasks where correctness can be precisely validated. Tasks include Visual Reasoning (STEM problems), Text Reasoning (complex math problems using Math-Verify library), Image Grounding (using GIoU for bounding boxes or point-in-box for points), Visual Counting (accuracy), and Temporal Video Grounding (using IoU for video segments).
    • Reinforcement Learning from Human Feedback (RLHF): Aligns the model with human preferences. It uses a diverse set of multimodal and text-only queries collected from open-source and in-house sources. Responses from MiMo-VL-7B and other VLMs are pairwise ranked by an advanced VLM to create a dataset for training dual reward models (text-only and multimodal).
    • RL Algorithm: A fully on-policy variant of GRPO [shao2024deepseekmath] is adopted for its stability and exploration capabilities. The algorithm samples multiple responses for each query and updates the policy based on the advantages computed from the rewards of the sampled group. Key advancements from [xia2025mimo] like dynamic sampling and easy data filtering are integrated.
    • Reward-as-a-Service (RaaS): A unified interface is implemented where a reward router dynamically selects the appropriate reward function (rule-based or model-based) based on the task type. Reward models are deployed as standalone services for low latency. Rewards are normalized to [0, 1].

Key Findings and Observations:

  • Reasoning Data Importance: Incorporating high-quality, synthetic reasoning data with long Chain-of-Thought into the later pre-training stages (especially Stage 4) significantly boosts model performance on complex reasoning tasks (MMMU, OSWorld-G, OlympiadBench) and leads to deeper reasoning (longer responses). Performance continues to improve without saturation in this stage.
  • MORL Effectiveness & Challenges: MORL successfully enhances performance across diverse capabilities, with notable gains on tasks like VibeEval and CountBench. However, achieving stable simultaneous improvements across all tasks during MORL remains challenging. This is potentially due to conflicting optimization objectives, such as reasoning tasks encouraging longer responses while grounding/counting tasks favor shorter ones, and disparities in task difficulty or reward hacking risks.
  • On-Policy RL Advantages: Experiments with text-only reasoning show that the on-policy GRPO variant demonstrates continuous performance improvement with more training data, unlike vanilla GRPO which saturates earlier.

Evaluation:

The models were evaluated on a comprehensive suite of over 50 tasks, categorized into General Visual Understanding, General Grounding/Counting, Document/Chart Understanding, Video Understanding/Localization, GUI Understanding/Grounding, Text-only Benchmarks, Multimodal Reasoning, and Text Reasoning. Comparisons were made against open-source models (Qwen2.5-VL, InternVL3, Gemma-3) and proprietary models (GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Pro, Qwen2.5, QVQ-72B Preview, UI-TARS, Aguvis, OS-Atlas) using the authors' adapted evaluation framework based on LMMs-Eval [lmmseval], open-sourced for reproducibility.

Key Results:

  • MiMo-VL-7B-SFT and MiMo-VL-7B-RL achieved state-of-the-art performance among open-source models of comparable scale across numerous vision-language and text benchmarks, often surpassing much larger models and sometimes specialized models (e.g., on OSWorld-G compared to UI-TARS).
  • MiMo-VL-7B-RL demonstrated superior performance on multimodal and text reasoning benchmarks compared to other open-source models, validating the effectiveness of their pre-training and MORL approach for reasoning.
  • On GUI tasks, MiMo-VL-7B achieved performance comparable to or exceeding specialized GUI models, particularly on challenging benchmarks like ScreenSpot-Pro and OSWorld-G.
  • An in-house bilingual user preference evaluation using GPT-4o judgments showed MiMo-VL-7B-RL achieving the highest Elo rating among open-source VLMs, closely approaching proprietary models like Claude 3.7 Sonnet. MORL provided a significant boost (22+ Elo points) over the SFT model.

Practical Applications & Contributions:

The models are presented as capable components for real-world applications requiring sophisticated visual perception, complex reasoning, and interaction with digital interfaces. The agentic capabilities for GUI interaction are highlighted with a case paper (navigating a website to add an item to a wishlist). Other case studies showcase strong OCR-to-table conversion and detailed step-by-step reasoning for complex STEM problems.

To promote transparency and future research, the authors open-source the MiMo-VL-7B model checkpoints and their comprehensive evaluation suite, including prompts and protocols.

In conclusion, the report details the development of MiMo-VL-7B models through a multi-stage pre-training approach incorporating diverse high-quality data, especially reasoning data, followed by a novel Mixed On-policy Reinforcement Learning phase. The models achieve state-of-the-art results among open-source alternatives and demonstrate strong capabilities across general V+L tasks, reasoning, and GUI interaction, while the authors also honestly discuss the challenges of achieving stable joint optimization in MORL across highly diverse tasks.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub