MiMo-VL-7B-RL: Open-Source 7B VLM

Updated 30 June 2025

MiMo-VL-7B-RL is an open-source vision-language model with 7B parameters that delivers robust visual understanding and multimodal reasoning.
It employs a staged pre-training strategy and a unified mixed on-policy reinforcement learning framework to fuse vision and language effectively.
The model sets new benchmarks in visual reasoning, GUI grounding, and video analysis, driving advances in compact multimodal AI applications.

MiMo-VL-7B-RL is an open-source, 7-billion parameter vision-LLM (VLM) developed for state-of-the-art performance in both general visual understanding and multimodal reasoning. Combining a modern VLM architecture, a staged pre-training strategy, and a unified mixed on-policy reinforcement learning (MORL) post-training phase, MiMo-VL-7B-RL establishes new milestones among open-source compact VLMs across a broad set of practical benchmarks and applications.

1. Model Architecture

MiMo-VL-7B-RL consists of three main components:

Vision Encoder:
- Backbone: Qwen2.5-ViT, a 32-layer Vision Transformer with 16 attention heads, hidden size 1280, patch size 14, and 2D rotary positional encoding (RoPE).
- Designed to process visual data including images and video frames, yielding high-fidelity visual features at native resolution.
MLP Projector:
- Multi-layer perceptron module that projects vision encoder outputs into the embedding space of the LLM.
- Facilitates efficient and effective cross-modal fusion.
LLM Backbone:
- Derived from MiMo-7B-Base: a 36-layer, hidden-size-4096 transformer model (intermediate size 11008, 32 attention heads, multi-axial RoPE).
- This LLM is specifically optimized for reasoning, with improved capacity from extra depth and broader hidden layers relative to comparable 7B VLMs.

The model supports sequence lengths up to 32K tokens, enabling long-context multimodal reasoning.

2. Pre-Training Strategy

MiMo-VL-7B-RL is pre-trained in four distinct stages using 2.4 trillion tokens balanced across multiple modalities and tasks:

Stage	Main Focus	Data & Tasks	Learning Rate	Parameters Trained
1	Projector Warmup	Image-caption pairs	1e-3	Only Projector
2	Vision-Language Alignment	Interleaved mixed-modality data	1e-4/1e-5	ViT + Projector
3	General Multimodal Pre-training	OCR, GUI, video, reasoning, instructions	1e-5	All
4	Long-context SFT	Reasoning-rich, long-form input/output	2.5e-5	All (32K seq. length)

Key elements include:

Aggressive deduplication and benchmark decontamination.
Direct incorporation of large volumes of synthetic, multi-step reasoning data (long Chain-of-Thought or CoT) from advanced LLMs, curated for clarity and correctness.
Modality coverage including images, GUIs, documents, high-resolution video, synthetic reasoning, and interleaved instructions.

Integration of reasoning and CoT tasks directly into early pre-training, rather than restricting them to later fine-tuning, is shown to provide significant and continued performance improvements.

3. Mixed On-Policy Reinforcement Learning (MORL)

After pre-training, MiMo-VL-7B-RL undergoes post-training using a unified MORL framework that combines on-policy RL across different reward types:

Reinforcement Learning with Verifiable Reward (RLVR):
- Rule-based, self-improving tasks, such as:
- Math answer checking (visual and text)
- GUI grounding (bounding box regression, pointing)
- Counting and temporal location in video
- Science and logic proofs
Reinforcement Learning from Human Feedback (RLHF):
- Alignment targeting helpfulness, safety, and user preference, via prompts and reward models.

The optimization is performed via a variation of Group Relative Policy Optimization (GRPO):

$\mathcal{J}_{\mathrm{GRPO}(\theta)} = \mathbb{E}_{q,\{o_i\}}\left[\frac{1}{\sum_{i=1}^G|o_i|}\sum_{i=1}^G \sum_{j=1}^{|o_i|} A_{i,j}\right]$

with

$A_{i,j} = \frac{r_i-\mathrm{mean}(\{r_i\}_{i=1}^G)}{\mathrm{std}(\{r_i\}_{i=1}^G)}$

where $q$ are prompts, $o_i$ are sampled outputs per prompt, and $r_i$ are corresponding reward values.

The MORL framework enables the model to continually improve over long RL schedules and prevents early saturation observed in off-policy or single-domain RL.

4. Performance Benchmarks

MiMo-VL-7B-RL advances the state of the art among compact VLMs, setting new records for 7B models on diverse public and community benchmarks:

Metric / Benchmark	MiMo-VL-7B-RL	Peer Comparison
OlympiadBench (Reasoning)	59.4	Qwen2.5-VL-72B: 37.2
OSWorld-G (GUI Grounding)	56.1	Qwen2.5-VL-7B: 37.5, UI-TARS: 51.2
MMMU (Gen. MultiModal)	66.7 (val)	Highest 7B open-source VLM
CharXiv RQ (Chart/Doc QA)	56.5	Qwen2.5-VL-7B: 42.5
MMLU-Pro (Text Reasoning)	64.8 EM	Stronger than peer 7B LLMs
Charades-STA (Video)	50.0 mIoU	Leading performance
Elo (User Preference)	Highest among open VLMs	Nearly matches proprietary Claude 3.7 Sonnet

Across a total of 50+ open tasks, MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 visual benchmarks.

5. Comprehensive Evaluation Suite

To enable transparent benchmarking and future research, MiMo-VL-7B-RL is accompanied by a publicly released, multi-domain evaluation suite covering:

General visual understanding (e.g., AI2D, MMMU, MME, CV-Bench)
OCR and chart understanding (e.g., InfoVQA, ChartQA, DocVQA)
Video understanding (e.g., Video-MME, Video-MMMU, Charades-STA)
Grounding and counting (RefCOCO, CountBench, PixmoCount)
GUI understanding (WebSrc, VisualWebBench, ScreenSpot, OSWorld-G)
Multimodal and textual reasoning (OlympiadBench, MathVision, DynaMath, MathVista)
Text understanding and QA (DROP, MMLU-Pro)
User preference (bilingual, pairwise prompts)

Datasets and prompts are designed according to open protocols for reproducibility and extensibility.

6. Applications and Practical Significance

MiMo-VL-7B-RL enables a broad set of vision-language applications:

Digital agents and UI automation: Accurate GUI element localization and instruction following in web/mobile/desktop environments.
STEM multimodal reasoning: Step-by-step visual math, logic, and science problem solving utilizing diagrams, video, and charts.
Document and chart analysis: High-fidelity extraction and comprehension of OCR and chart data for enterprise, finance, and administrative contexts.
Video event analysis: Temporal and spatial localization within video, supporting egocentric and multi-modal perception tasks.
Conversational AI: Handles complex, multi-turn queries with bilingual support across multiple modalities.
Agentic intelligence: Foundation for compact, robust, open-source multimodal assistants.

Its release, combined with its evaluation suite, enables transparent comparison and rapid extension for new research tasks.

7. Research Implications and Future Directions

The MiMo-VL-7B-RL approach demonstrates several critical advances for the VLM field:

Small model scaling: MiMo-VL-7B-RL, despite its modest size (7B), achieves or exceeds the performance of much larger open and proprietary VLMs (up to 78B) through data curation and carefully staged learning.
Reasoning-centric data: Direct integration of synthetic long-form CoT during pre-training, instead of limited fine-tuning, significantly boosts reasoning generalization.
Unified RL for multimodality: The MORL pipeline achieves joint optimization over diverse modalities and tasks, though with open challenges in fine-grained task balance and scaling.
Open science: Reproducibility is enabled by full release of models, evaluation code, and benchmark data, fostering rapid adoption and transparent progress.
Agent foundation: Robust GUI grounding and cross-domain reasoning point toward compact, open generalist AI agents in digital environments.

Summary Table: MiMo-VL-7B-RL Characteristics

Aspect	Details
Model Size	7B parameters
Architecture	Qwen2.5-ViT (32L) + MLP Projector + MiMo-7B-Base (36L, hidden 4096)
Pre-training	4-stage, 2.4T tokens, multimodal & CoT-centric, 32K context
RL Post-training	Mixed On-policy RL: RLHF + RLVR + on-policy GRPO
Evaluation	50+ open benchmarks (images, video, GUI, text), state-of-the-art results
Open Source	Full checkpoints, code, and evaluation suite at GitHub

PDF Markdown Chat (Pro)