MiMo-VL-Miloco-7B: Home-Centric Vision-Language Model
- The paper presents a two-stage training pipeline combining supervised fine-tuning and GRPO that significantly enhances home-centric tasks while preserving broad multimodal competence.
- The model employs a native-resolution Vision Transformer with integrated chain-of-thought reasoning and token-budget awareness, ensuring precise gesture capture and efficient inference.
- Empirical results show state-of-the-art performance with 4–18 F1 points improvement on home tasks and over 99% accuracy retention in the quantized GGUF variant.
MiMo-VL-Miloco-7B is an open-source vision-LLM tailored for home-centric multimodal understanding, built on the MiMo-VL-7B backbone, with adaptations for scenario-specific perception, chain-of-thought (CoT) reasoning, and efficient inference. It leverages a two-stage pipeline combining supervised fine-tuning (SFT) and reinforcement learning via Group Relative Policy Optimization (GRPO), enabling state-of-the-art performance in both dedicated home-domain tasks and broad multimodal benchmarks. A quantized version, MiMo-VL-Miloco-7B-GGUF, supports efficient edge deployment with minimal loss in accuracy. Model checkpoints and an evaluation toolkit are publicly available to support reproducibility and further research (Li et al., 19 Dec 2025).
1. Model Architecture and Home-Centric Adaptations
MiMo-VL-Miloco-7B comprises a Vision Transformer (ViT) encoder processing raw video frames, an MLP projector aligning visual embeddings with the LLM input space, and a 7B-parameter LLM backbone initialized from MiMo-VL-7B. Home-centric modifications include the use of a native-resolution ViT for detailed gesture capture, integration of CoT heads in the LLM for explicit intermediate reasoning about home-related activities, and specialized token-budget prompts to facilitate concise, latency-efficient on-device inference.
The multi-modal input pipeline interleaves visual and text modalities as follows:
- Video input is sampled at 2 FPS (up to 256 frames per sample), with each frame split into non-overlapping 28×28 patches (maximum 4096 patches per image), encoded by ViT. Resulting visual embeddings are projected into the token space via MLP.
- Textual data, such as instructions or queries, are tokenized with byte-pair encoding (BPE), optionally augmented with special CoT guide markers.
- The final input sequence concatenates visual and text tokens for LLM consumption.
2. Two-Stage Training Pipeline: SFT and GRPO
The training regimen balances home-domain specialization with general multimodal competence through sequential SFT followed by GRPO-based reinforcement learning:
Stage 1: Supervised Fine-Tuning (SFT)
- Training utilizes proprietary datasets of annotated home-scenario videos (five daily activities and five gestures), with each sample labeled and provided with stepwise CoT rationales.
- Token-budget aware prompts and supervision incentivize concise output.
- Additional data from the general MiMo-VL corpus covers VQA, visual grounding, OCR, video QA, and multimodal reasoning.
- Loss functions consist of cross-entropy on all generated tokens (CoT and answer), with an extra penalty for excessive token generation in token-budget tasks.
- Major hyperparameters: AdamW optimizer, batch size 128, learning rate , and maximum sequence length of 32,768 tokens.
Stage 2: Reinforcement Learning (GRPO)
- GRPO restores and augments general multimodal capacities post-SFT, employing a filtered dataset blend emphasizing temporal grounding, GUI prediction, and STEM-type reasoning.
- The GRPO objective is:
with group-normalized advantage
Rewards mix accuracy (e.g., IoU or exact match) and format compliance ().
3. Specialized Reasoning: Chain-of-Thought and Token-Budget Awareness
- Chain-of-Thought (CoT): Home-domain data is systematically annotated with explicit, stepwise rationales. Training examples are prompted by phrases such as “Think step by step:” and supervised via cross-entropy on both reasoning tokens and the answer. This enhances the model’s interpretability and intermediate reasoning for fine-grained activity and gesture classification.
- Token-Budget Awareness: Select tasks employ token-efficient prompts (“Answer concisely:”) and loss penalties for verbosity, directly optimizing for concise outputs conducive to edge-device inference. At inference, this reduces average latency by approximately 30%, with the model defaulting to concise responses unless richer CoT is explicitly requested.
4. Quantization and Edge Deployment: GGUF Variant
To enable efficient edge deployment, MiMo-VL-Miloco-7B is quantized to the GGUF (Generic GGML Unified Format), converting transformer weights to 4-bit integers (LSQ). Attention and MLP layers employ NF4 quantization; layernorm and embeddings retain 16-bit float precision for stability. This reduces memory footprint from ~28 GB (fp16) to ~7 GB (4-bit), with minimal accuracy loss (≤ 0.5% absolute), and yields a ~20% speed increase on a single edge GPU due to improved memory bandwidth. The GGUF variant retains more than 99% of the performance of the full-precision model on both domain-specific and general tasks.
5. Empirical Results and Benchmark Performance
MiMo-VL-Miloco-7B demonstrates leading performance across home-scenario and general vision-language benchmarks. The following tables summarize select results (Li et al., 19 Dec 2025):
Home-Scenario Activity and Gesture Recognition (per-category F1)
| Model | Watch TV | Reading | Phone | Esports | Workout | OK | Thumbs Up | V Sign | Shaka | Open Palm |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 81.2 | 78.7 | 88.4 | 95.2 | 88.9 | 73.8 | 78.6 | 67.0 | 33.3 | 53.4 |
| Gemini-2.5-Pro | 93.8 | 88.9 | 81.9 | 95.9 | 91.9 | 81.8 | 86.9 | 86.7 | 70.1 | 87.4 |
| Miloco-7B | 98.3 | 90.8 | 90.5 | 99.2 | 96.7 | 86.3 | 90.5 | 90.3 | 88.3 | 87.5 |
| Miloco-7B-GGUF | 97.9 | 90.4 | 90.0 | 98.8 | 96.2 | 85.9 | 90.1 | 89.9 | 87.8 | 87.1 |
Multimodal Video and Language Understanding
| Model | Video-MME | Video-MMMU | Charades-STA | MMMU-Pro(std) | MMMU-Pro(vision) | MMLU-Pro (EM) |
|---|---|---|---|---|---|---|
| Miloco-7B | 68.0 | 63.6 | 46.6 | 55.7 | 47.2 | 68.5 |
| GGUF | 67.7 | 63.2 | 46.2 | 55.3 | 46.9 | 68.1 |
Miloco-7B consistently surpasses open-source (Qwen2.5-VL, InternVL3-8B) and closed-source (Gemini-2.5-Pro) baselines on home-scenario tasks (by 4–18 F1 points), and sets new open-source records on generalized multimodal benchmarks (e.g., +6 points on Video-MMMU compared to the prior best). The GGUF variant matches >99% of full-precision accuracy on all tasks.
6. Analysis, Limitations, and Future Directions
SFT yields substantial improvements in household activity and gesture detection, with only minor catastrophic forgetting (<3 points) on long-document OCR and some GUI grounding tasks. The subsequent reinforcement learning stage largely restores general capabilities, minimizing this trade-off.
Minor regressions remain on document-centric tasks (e.g., 1-point reduction on OCRBench). Future research directions include multi-objective tuning for domain balancing, integration of additional modalities (audio, motion sensors), and further quantization/pruning to target sub-5 GB memory footprints on resource-constrained devices.
A plausible implication is that targeted scenario-specialization, when combined with robust recovery via reinforcement learning, allows for high-impact application of vision-LLMs in real-world settings such as the smart home domain, without significant sacrifice to generalization (Li et al., 19 Dec 2025).
7. Relationship to MiMo-VL-7B Family and Broader Research Context
MiMo-VL-Miloco-7B extends the MiMo-VL-7B-RL model family (Team et al., 4 Jun 2025), which pioneered a four-stage, 2.4-trillion token pre-training architecture combining massive scale with CoT-focused data mixing and a Mixed On-policy Reinforcement Learning (MORL) regime. MiMo-VL-Miloco-7B adapts this backbone for home environments, refining CoT annotation/methodology and introducing explicit token-budget calibration.
Both MiMo-VL-Miloco-7B and MiMo-VL-7B models adopt group-normalized or on-policy RL objectives, employ verifiable and human preference rewards, and leverage extensive, diverse multi-domain datasets. The research outputs highlight the utility of scenario-specific adaptations, explicit reasoning supervision, and efficient quantization for practical, deployable vision-LLMs (Team et al., 4 Jun 2025, Li et al., 19 Dec 2025).