MiMo-VL-7B-RL: Open-Source 7B VLM
- MiMo-VL-7B-RL is an open-source vision-language model with 7B parameters that delivers robust visual understanding and multimodal reasoning.
- It employs a staged pre-training strategy and a unified mixed on-policy reinforcement learning framework to fuse vision and language effectively.
- The model sets new benchmarks in visual reasoning, GUI grounding, and video analysis, driving advances in compact multimodal AI applications.
MiMo-VL-7B-RL is an open-source, 7-billion parameter vision-LLM (VLM) developed for state-of-the-art performance in both general visual understanding and multimodal reasoning. Combining a modern VLM architecture, a staged pre-training strategy, and a unified mixed on-policy reinforcement learning (MORL) post-training phase, MiMo-VL-7B-RL establishes new milestones among open-source compact VLMs across a broad set of practical benchmarks and applications.
1. Model Architecture
MiMo-VL-7B-RL consists of three main components:
- Vision Encoder:
- Backbone: Qwen2.5-ViT, a 32-layer Vision Transformer with 16 attention heads, hidden size 1280, patch size 14, and 2D rotary positional encoding (RoPE).
- Designed to process visual data including images and video frames, yielding high-fidelity visual features at native resolution.
- MLP Projector:
- Multi-layer perceptron module that projects vision encoder outputs into the embedding space of the LLM.
- Facilitates efficient and effective cross-modal fusion.
- LLM Backbone:
- Derived from MiMo-7B-Base: a 36-layer, hidden-size-4096 transformer model (intermediate size 11008, 32 attention heads, multi-axial RoPE).
- This LLM is specifically optimized for reasoning, with improved capacity from extra depth and broader hidden layers relative to comparable 7B VLMs.
The model supports sequence lengths up to 32K tokens, enabling long-context multimodal reasoning.
2. Pre-Training Strategy
MiMo-VL-7B-RL is pre-trained in four distinct stages using 2.4 trillion tokens balanced across multiple modalities and tasks:
Stage | Main Focus | Data & Tasks | Learning Rate | Parameters Trained |
---|---|---|---|---|
1 | Projector Warmup | Image-caption pairs | 1e-3 | Only Projector |
2 | Vision-Language Alignment | Interleaved mixed-modality data | 1e-4/1e-5 | ViT + Projector |
3 | General Multimodal Pre-training | OCR, GUI, video, reasoning, instructions | 1e-5 | All |
4 | Long-context SFT | Reasoning-rich, long-form input/output | 2.5e-5 | All (32K seq. length) |
Key elements include:
- Aggressive deduplication and benchmark decontamination.
- Direct incorporation of large volumes of synthetic, multi-step reasoning data (long Chain-of-Thought or CoT) from advanced LLMs, curated for clarity and correctness.
- Modality coverage including images, GUIs, documents, high-resolution video, synthetic reasoning, and interleaved instructions.
Integration of reasoning and CoT tasks directly into early pre-training, rather than restricting them to later fine-tuning, is shown to provide significant and continued performance improvements.
3. Mixed On-Policy Reinforcement Learning (MORL)
After pre-training, MiMo-VL-7B-RL undergoes post-training using a unified MORL framework that combines on-policy RL across different reward types:
- Reinforcement Learning with Verifiable Reward (RLVR):
- Rule-based, self-improving tasks, such as:
- Math answer checking (visual and text)
- GUI grounding (bounding box regression, pointing)
- Counting and temporal location in video
- Science and logic proofs
- Reinforcement Learning from Human Feedback (RLHF):
- Alignment targeting helpfulness, safety, and user preference, via prompts and reward models.
The optimization is performed via a variation of Group Relative Policy Optimization (GRPO):
with
where are prompts, are sampled outputs per prompt, and are corresponding reward values.
The MORL framework enables the model to continually improve over long RL schedules and prevents early saturation observed in off-policy or single-domain RL.
4. Performance Benchmarks
MiMo-VL-7B-RL advances the state of the art among compact VLMs, setting new records for 7B models on diverse public and community benchmarks:
Metric / Benchmark | MiMo-VL-7B-RL | Peer Comparison |
---|---|---|
OlympiadBench (Reasoning) | 59.4 | Qwen2.5-VL-72B: 37.2 |
OSWorld-G (GUI Grounding) | 56.1 | Qwen2.5-VL-7B: 37.5, UI-TARS: 51.2 |
MMMU (Gen. MultiModal) | 66.7 (val) | Highest 7B open-source VLM |
CharXiv RQ (Chart/Doc QA) | 56.5 | Qwen2.5-VL-7B: 42.5 |
MMLU-Pro (Text Reasoning) | 64.8 EM | Stronger than peer 7B LLMs |
Charades-STA (Video) | 50.0 mIoU | Leading performance |
Elo (User Preference) | Highest among open VLMs | Nearly matches proprietary Claude 3.7 Sonnet |
Across a total of 50+ open tasks, MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 visual benchmarks.
5. Comprehensive Evaluation Suite
To enable transparent benchmarking and future research, MiMo-VL-7B-RL is accompanied by a publicly released, multi-domain evaluation suite covering:
- General visual understanding (e.g., AI2D, MMMU, MME, CV-Bench)
- OCR and chart understanding (e.g., InfoVQA, ChartQA, DocVQA)
- Video understanding (e.g., Video-MME, Video-MMMU, Charades-STA)
- Grounding and counting (RefCOCO, CountBench, PixmoCount)
- GUI understanding (WebSrc, VisualWebBench, ScreenSpot, OSWorld-G)
- Multimodal and textual reasoning (OlympiadBench, MathVision, DynaMath, MathVista)
- Text understanding and QA (DROP, MMLU-Pro)
- User preference (bilingual, pairwise prompts)
Datasets and prompts are designed according to open protocols for reproducibility and extensibility.
6. Applications and Practical Significance
MiMo-VL-7B-RL enables a broad set of vision-language applications:
- Digital agents and UI automation: Accurate GUI element localization and instruction following in web/mobile/desktop environments.
- STEM multimodal reasoning: Step-by-step visual math, logic, and science problem solving utilizing diagrams, video, and charts.
- Document and chart analysis: High-fidelity extraction and comprehension of OCR and chart data for enterprise, finance, and administrative contexts.
- Video event analysis: Temporal and spatial localization within video, supporting egocentric and multi-modal perception tasks.
- Conversational AI: Handles complex, multi-turn queries with bilingual support across multiple modalities.
- Agentic intelligence: Foundation for compact, robust, open-source multimodal assistants.
Its release, combined with its evaluation suite, enables transparent comparison and rapid extension for new research tasks.
7. Research Implications and Future Directions
The MiMo-VL-7B-RL approach demonstrates several critical advances for the VLM field:
- Small model scaling: MiMo-VL-7B-RL, despite its modest size (7B), achieves or exceeds the performance of much larger open and proprietary VLMs (up to 78B) through data curation and carefully staged learning.
- Reasoning-centric data: Direct integration of synthetic long-form CoT during pre-training, instead of limited fine-tuning, significantly boosts reasoning generalization.
- Unified RL for multimodality: The MORL pipeline achieves joint optimization over diverse modalities and tasks, though with open challenges in fine-grained task balance and scaling.
- Open science: Reproducibility is enabled by full release of models, evaluation code, and benchmark data, fostering rapid adoption and transparent progress.
- Agent foundation: Robust GUI grounding and cross-domain reasoning point toward compact, open generalist AI agents in digital environments.
Summary Table: MiMo-VL-7B-RL Characteristics
Aspect | Details |
---|---|
Model Size | 7B parameters |
Architecture | Qwen2.5-ViT (32L) + MLP Projector + MiMo-7B-Base (36L, hidden 4096) |
Pre-training | 4-stage, 2.4T tokens, multimodal & CoT-centric, 32K context |
RL Post-training | Mixed On-policy RL: RLHF + RLVR + on-policy GRPO |
Evaluation | 50+ open benchmarks (images, video, GUI, text), state-of-the-art results |
Open Source | Full checkpoints, code, and evaluation suite at GitHub |