Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
85 tokens/sec
GPT-4o
75 tokens/sec
Gemini 2.5 Pro Pro
60 tokens/sec
o3 Pro
39 tokens/sec
GPT-4.1 Pro
76 tokens/sec
DeepSeek R1 via Azure Pro
54 tokens/sec
2000 character limit reached

MiMo-VL-7B-RL: Open-Source Multimodal Reasoning and GUI Grounding

Last updated: June 10, 2025

Significance and Background

MiMo-VL-7B-RL is an open-source vision-LLM ° designed for high-performance general visual understanding ° and complex multimodal reasoning ° within a compact 7B parameter context. Unlike most open-source VLMs, which achieve strong results primarily at substantially larger (≥70B) parameter counts ° or within narrow domains, MiMo-VL-7B-RL pursues broad generalization across domains such as visual question answering, document and chart understanding, video analysis, and particularly graphical user interface ° (GUI °) grounding—all without resorting to a specialist architecture or excessive model size (Team et al., 4 Jun 2025 ° ).

A central goal in the development of MiMo-VL-7B-RL is to leverage high-quality reasoning data—especially long chain-of-thought ° (CoT) traces—throughout both pre-training and reinforcement learning (RL). Its evaluation spans an extensive suite of over fifty tasks, including general vision-language benchmarks, STEM ° and agentic tasks, GUI alignment, and multimodal reasoning, using open-sourced datasets and evaluation logic to promote reproducibility (Team et al., 4 Jun 2025 ° ).

Foundational Concepts

High-Resolution Multimodal Backbone

MiMo-VL-7B-RL utilizes a high-resolution, multimodal architecture ° structured as follows (Team et al., 4 Jun 2025 ° ):

  • Native-Resolution ViT Encoder: Based on Qwen2.5-ViT, this module provides high-fidelity feature extraction from images and videos (32 layers, 16 heads, hidden size 1280, 2D RoPE positional encoding).
  • MLP ° Projector: A multi-layer perceptron that projects visual features into the shared latent space ° required by the LLM, promoting robust cross-modal alignment °.
  • MiMo-7B LLM Backbone: A 36-layer, 4096-hidden-unit transformer with 32 attention heads ° and MRoPE positional encoding, optimized specifically for reasoning tasks and multimodal context.

All supported modalities—including images, GUI screenshots, video frames, and text—are processed in high resolution and handled in a unified, context-aware token sequence (Team et al., 4 Jun 2025 ° ).

Four-Stage Pre-training Leveraging Chain-of-Thought Data

Pre-training proceeds in four distinct stages (Team et al., 4 Jun 2025 ° ):

  1. Projector Warmup: The projector is initialized with image-caption data, while both the ViT and LLM are frozen (sequence length 8K, 300B tokens).
  2. Vision-Language (VL) Alignment: Training is extended to interleaved image-text content ° (PDFs, books, web), focusing on knowledge-rich and well-annotated datasets (ViT & projector trainable, 167B tokens).
  3. General Multimodal Learning: Expansive pre-training on OCR, grounding, question answering, video, GUI, and reasoning datasets, with all model parameters trainable (1.4T tokens).
  4. Long-Context Supervised Fine-Tuning (SFT °): Incorporation of long-form documents, videos, high-resolution images, and, critically, synthetic long chain-of-thought reasoning ° traces for robust deep reasoning (sequence length 32K, 550B tokens).

Deduplication is rigorously enforced using techniques such as perceptual hashing ° (phash) to avoid contamination from test sets (Team et al., 4 Jun 2025 ° ).

A distinguishing methodological choice is the deep integration of CoT data into all pre-training stages, not just as a fine-tuning step. This approach supports generalizable and robust problem-solving far beyond pattern-based question answering (Team et al., 4 Jun 2025 ° ).

Mixed On-Policy Reinforcement Learning (MORL)

MiMo-VL-7B-RL is further optimized using Mixed On-policy Reinforcement Learning ° (MORL °), which simultaneously incorporates rewards from several sources (Team et al., 4 Jun 2025 ° ):

Reward computation is implemented modularly through a "Reward-as-a-Service" pipeline, using HTTP-based routing to select and batch-normalize varying reward models °. RL updates employ an on-policy variant of Generalized Reward Policy Optimization (GRPO), with observed improvements in stability and scalability over common off-policy schemes (Team et al., 4 Jun 2025 ° ).

JGRPO(θ)=EqD,{oi}πθ[1i=1Goii=1Gj=1oiAi,j]\mathcal{J}_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{q \sim D, \{o_i\} \sim \pi_\theta}\left[ \frac{1}{\sum_{i=1}^{G} |o_i|} \sum_{i=1}^G \sum_{j=1}^{|o_i|} A_{i,j} \right]

with

Ai,j=rimean({ri})std({ri})A_{i,j} = \frac{r_i - \mathrm{mean}(\{r_i\})}{\mathrm{std}(\{r_i\})}

where rir_i is the reward per output, DD the dataset, and πθ\pi_\theta the current policy.

One repercussion of this multifaceted RL objective ° is the possibility of "objective interference," where, for example, the preference for longer, detailed CoT outputs may conflict with optimal performance on succinct, spatial grounding ° tasks. Nevertheless, empirical results indicate that MORL achieves greater overall progress—particularly in GUI and agentic domains—than single-objective or domain-specific RL pipelines (Team et al., 4 Jun 2025 ° ).

Key Developments and Findings

Architecture and Pre-training

MiMo-VL-7B-RL employs a comprehensive three-part architecture, with all trainable modules exposed in later pre-training and SFT stages. It is pre-trained on 2.4 trillion tokens—one of the largest efforts for a model of this size—with dataset curation that emphasizes diversity, quality, and deduplication (including phash-based image filtering and prompt-level rule-based decontamination). Synthetic CoT and reasoning data, generated and filtered using advanced LLMs, contribute strongly to multi-step problem-solving capabilities ° (Team et al., 4 Jun 2025 ° ).

MORL Implementation and Effects

The MORL framework allows for simultaneous optimization across perception, reasoning, GUI, and agentic domains by normalizing and combining diverse reward signals °. On-policy RL ° is applied with batchwise advantage normalization, and all reward computation is handled via a modular, scalable service infrastructure. This supports multi-domain optimization and faster progress as the RL dataset grows (Team et al., 4 Jun 2025 ° ).

The MORL setup mitigates (but does not eliminate) the "plateauing" observed in off-policy RL ° while enabling competitive or state-of-the-art results in multiple domains. The primary challenge—cross-task interference—remains an open area for improvement (Team et al., 4 Jun 2025 ° ).

Evaluation Methodology

MiMo-VL-7B-RL is evaluated on an open, comprehensive suite of more than fifty benchmarks covering (Team et al., 4 Jun 2025 ° ):

All prompts, evaluation logic, and deduplicated test sets are open-sourced to support reproducibility and fair comparison.

Performance Metrics and Outcomes

MiMo-VL-7B-RL sets new open-source standards for general and specialist vision-language performance ° in the 7B class, supported by extensive benchmark data ° (Team et al., 4 Jun 2025 ° ):

Task/Domain MiMo-VL-7B-RL Notable Comparisons
MMMU 66.7 Qwen2.5-VL-7B: 58.6; Gemma ° 3-27B: 64.9
OlympiadBench 59.4 Qwen2.5-VL-72B: 37.2; GPT-4o: 25.9
OSWorld-G (GUI) 56.1 (center accuracy) Qwen2.5-VL-7B: 37.5; UI-TARS: 53.2
CharXiv-RQ/ChartQA 56.5 / SOTA ° Highest among open-source models
Elo (Arena Preference) Highest open-source Approaching Claude 3.7 Sonnet °

The model excels in advanced reasoning domains, with particularly strong generalization ° for long-form CoT tasks; for example, output length on the MMMU benchmark ° increases from 680 to 2,500 tokens post-CoT integration, reflecting the depth of multi-step output (Team et al., 4 Jun 2025 ° ). GUI grounding and spatial alignment performance ° outperform even specialized models ° without recourse to niche architectural modifications °.

Emerging Trends and Future Directions

Deep Integration of Long-Form Reasoning

Ablation studies underscore that integrating extensive high-quality chain-of-thought data ° throughout pre-training, not merely at the SFT stage, reliably elevates both perception and reasoning metrics (Team et al., 4 Jun 2025 ° ). Models tuned on short-answer or repetitive data alone plateau at lower performance.

Multidomain RL and Reward as a Service

The modular, externally routable "Reward-as-a-Service" design for RL computation enables rapid adoption of new reward signals and scalable multi-domain learning. Supporting all modalities (perception, grounding, reasoning, agentics) within a single RL infrastructure appears influential for broad-based progress (Team et al., 4 Jun 2025 ° ).

Transparency and Community Engagement

By releasing all model checkpoints, the entire evaluation suite ° (tasks, code, logic), and comprehensive performance results, MiMo-VL-7B-RL prioritizes reproducibility and community benchmarking. This fosters fair, ongoing, and open comparison as the field advances (Team et al., 4 Jun 2025 ° ).

Limitations and Open Questions

While MORL enables substantial progress, simultaneous optimization for conflicting objectives occasionally leads to cross-domain interference—such as longer CoT outputs negatively impacting tasks that benefit from brevity. Present data suggests further research in RL objective scheduling, dynamic reward balancing, or modular adaptation will be necessary to resolve these interactions (Team et al., 4 Jun 2025 ° ).

Speculative Note: The capacity of this approach to scale with the addition of new modalities (such as audio) or further reward domains remains to be demonstrated with concrete data.

Conclusion

MiMo-VL-7B-RL combines a robust, high-resolution architecture, extensive pre-training inclusive of chain-of-thought reasoning, and a sophisticated mixed-reward RL regime to deliver state-of-the-art performance for both general vision-language understanding ° and specialized tasks such as GUI grounding—all in a scalable 7B parameter footprint. Its fully open-source release, open evaluation framework, and deep commitment to high-quality, reasoning-centered training establish a concrete foundation for future work in multimodal RL and vision-language systems ° (Team et al., 4 Jun 2025 ° ).

References

MiMo-VL-7B-RL model checkpoints, datasets, and evaluation suite: https://github.com/XiaomiMiMo/MiMo-VL

Key Source:

MiMo-VL Technical Report (Team et al., 4 Jun 2025 ° )