Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

MiMo-VL-7B-SFT: Multimodal Reasoning Model

Updated 30 June 2025

MiMo-VL-7B-SFT is an open-source 7B parameter vision-language model that integrates advanced visual encoding with deep chain-of-thought reasoning.
The model employs a four-stage pre-training strategy to progressively align modalities, handling diverse tasks like OCR, GUI navigation, video analysis, and STEM problem-solving.
Benchmark evaluations across 50+ tasks indicate that MiMo-VL-7B-SFT outperforms larger models, demonstrating robust performance in visual, textual, and agentic domains.

MiMo-VL-7B-SFT is an open-source 7-billion parameter vision-LLM designed to provide state-of-the-art performance across general visual understanding and multimodal reasoning, including STEM problem-solving, GUI grounding, OCR, video, and agentic tasks. Its development and evaluation are centered on four-stage multimodal pre-training, an architecture optimized for long-chain-of-thought (CoT) reasoning, and an evaluation strategy spanning 50+ open benchmarks.

1. Model Architecture

MiMo-VL-7B-SFT is built on a three-part architecture:

Vision Encoder:
- A Vision Transformer (ViT) analogous to Qwen2.5-ViT.
- 32 layers, 16 attention heads, hidden size 1280, 2D rotary position encoding (RoPE).
- Native-resolution image support.
MLP Projector:
- Trains to align the ViT output with the input space of the LLM.
- Random initialization, with warmup and progressive training, ensures stable vision-language alignment.
LLM Backbone:
- Based on MiMo-7B-Base, designed specifically for deep reasoning.
- 36 transformer layers, 32 attention heads, hidden size 4096, intermediate size 11008, with MRoPE positional encoding.
- Architecture is selected for high performance on reasoning and chain-of-thought (CoT) data.

Configuration Table:

	Vision Encoder	LLM
Layers	32	36
Attention Heads	16	32
Hidden Size	1280	4096
Interm. Size	3456	11008
Pos. Encoding	2D RoPE	MRoPE
Patch Size	14	–

This design ensures both high-quality visual perception and deep reasoning capabilities.

2. Four-Stage Pre-Training Pipeline

MiMo-VL-7B-SFT employs a four-stage pre-training strategy with 2.4 trillion multimodal tokens:

Stage	Purpose	Data Type	LR	Params Updated
1	Projector Warmup	Img-Caption (MetaCLIP)	1e-3	Projector
2	Vision-Language Alignment	Interleaved multi-modal	1e-4/1e-5	ViT+Projector
3	General Multimodal Pretrain	Diverse multimodal (OCR, GUI, etc.)	1e-5	All
4	Long-Context SFT	Long sequences, CoT reasoning	2.5e-5	All

Stage 1 locks all weights except the projector, preventing early training instability.
Stage 2 progressively unfreezes the ViT, introducing complex interleaved image-text data to refine multimodal alignment.
Stage 3 opens all weights, introducing a full diversity of tasks, modalities, and increased context lengths (up to 32k tokens), including synthetic reasoning, OCR, video, GUI, and agentic navigation.
Stage 4 emphasizes high-quality, long-form chain-of-thought (CoT) reasoning data for robust reasoning skill acquisition.

This methodology prioritizes robust cross-modal feature learning, minimizing catastrophic forgetting, and ensuring that generalization and deep reasoning emerge as global model properties.

3. Data Curation and Distribution

The data pipeline integrates:

Image-caption pairs: Extensive extraction, recaptioning (MetaCLIP-style), and deduplication.
Interleaved multimodal data: Captions, earned text, and visual content extracted from diverse web, book, and research sources and carefully filtered for domain quality and balance.
OCR and grounding: Blend of natural scene text, handwritten/blurry regions, and explicit region annotations.
Video: Temporally aligned, event-level, and global semantic labels, spanning diverse genres and lengths.
GUI/HCI: Combination of open and synthetic GUI/human-computer data, supporting multi-modality action representations.
Reasoning: Synthetic and filtered real-world CoT data, specifically targeting math, chart, document, and agentic reasoning.

Data filtering uses pHash deduplication, multi-phase language balancing, and aggressive filtering for clarity, relevance, and reasoning chain evidence, especially in later training stages.

4. Generalization and Mixed On-Policy Reinforcement Learning (MORL)

While MiMo-VL-7B-SFT is trained with supervised multimodal data, the architecture and data pipeline are developed for compatibility with Mixed On-Policy Reinforcement Learning (MORL), which is applied to yield MiMo-VL-7B-RL.

RLVR (Reinforcement Learning with Verifiable Rewards): Applied to math, grounding, counting, and video, with rule/verifier-based scoring (e.g., Math-Verify, GIoU, IoU).
RLHF (Human Feedback): Pairwise preference models (text-only and multimodal) trained on expert-judged rankings.
On-Policy GRPO Algorithm: For each task, a group of outputs is sampled, and rewards are standardized. Only on-policy updates are performed, supporting stable exploration and scaling for new reward signals. No surrogate loss; direct optimization.

The SFT-trained model is designed to gain immediate benefits from MORL-based post-training, especially in reasoning-heavy and verifiable visual domains. The modular reward and task routing ensures extensibility to document reasoning, navigation, and user interaction scenarios.

5. Performance Benchmarks and Comparative Results

MiMo-VL-7B-SFT is evaluated on a comprehensive open-source benchmark suite covering more than 50 tasks, including image/video understanding, OCR, GUI, multimodal reasoning, and STEM domains.

MMMU (val): 64.6 (SFT), 66.7 (RL) vs. 58.6 (Qwen2.5-VL-7B)
CharXiv-RQ: 56.5 (RL) vs. 42.5 (Qwen2.5-VL-7B)
CountBench: 90.4 (RL) vs. 74.1 (Qwen2.5-VL-7B)
OSWorld-G (GUI): 56.1 (RL) vs. 37.5 (Qwen2.5-VL-7B), surpassing UI-TARS
OlympiadBench: 59.4 (SFT/RL) vs. 37.2 (Qwen2.5-VL-72B)

On MathVision, CharXiv, MathVerse, and OlympiadBench, MiMo-VL-7B-(SFT/RL) outperforms much larger models, including those up to 72B parameters, and leads among all open-source vision-LLMs. Statistical evidence suggests the four-stage pre-training with extensive reasoning exposure is critical, as benchmark performance increases with greater synthetic CoT coverage and longer sequence handling.

6. Significance of Design Choices

Four-Stage Pre-Training: Early freezing of modality-specific weights stabilizes training and prevents catastrophic noise. The staged introduction of interleaved/complex tasks and massive synthetic CoT data produces deep, repeatable reasoning and multimodal generalization.
Long-Context Capability: Training to 32k tokens allows the model to handle full documents, entire STEM exams, and temporally extended video or GUI navigation.
Reward Routing in RL: Modular, normalized reward definition with per-task routing supports efficient cross-domain learning and rapid integration of new feedback modalities.
Evaluation Suite: Open protocols and prompts provide transparent, reproducible comparison metrics for all major task types.

7. Applications and Future Prospects

MiMo-VL-7B-SFT enables practical use cases including cross-modal question answering, document/OCR reasoning, GUI navigation and control, visual agentic planning, and STEM education. The robust reasoning skillset, generalization to diverse task types, and high parameter efficiency position the model as a foundation for both research and applied AI systems in vision-language understanding, document automation, digital agents, and instructional tools.

Anticipated future directions include the mitigation of task interference in MORL (reasoning versus grounding/counting), data expansion for underrepresented modalities (video, logic, robotics), extension to real-world perception-action loops, and the improvement of alignment for diverse end-user groups.

8. Summary

MiMo-VL-7B-SFT is distinguished by its staged, multimodal pre-training, reasoning-optimized backbone, and broad evaluation, consistently setting state-of-the-art results across open-source VLMs. Its architecture and training pipeline demonstrate that compact models, with targeted data curation and methodical reasoning exposure, can outperform much larger alternatives, especially in complex visual, mathematical, and agentic domains. Open-sourcing the model checkpoints and evaluation suite promotes broad adoption and reproducibility for advancing multimodal intelligence.

PDF Markdown Chat (Upgrade)