Papers
Topics
Authors
Recent
2000 character limit reached

MiMo-VL-Miloco-7B-GGUF: Quantized Vision-Language Model

Updated 22 December 2025
  • MiMo-VL-Miloco-7B-GGUF is a quantized vision-language model that combines a ViT encoder and a 7B LLM for efficient home-scenario understanding and multimodal reasoning.
  • It employs a two-stage training strategy using supervised fine-tuning and GRPO-based reinforcement learning to enhance performance on gesture, activity, and reasoning tasks.
  • The GGUF quantization reduces memory footprint and inference latency with minimal accuracy trade-offs, making it ideal for real-time smart-home applications.

MiMo-VL-Miloco-7B-GGUF is a quantized vision-LLM tailored for home-scenario understanding and general multimodal reasoning. Developed by Xiaomi and built on the MiMo-VL-7B architecture, it incorporates a combination of a Vision Transformer (ViT) encoder, an MLP projector for cross-modal alignment, and a 7B-parameter LLM. The GGUF variant provides memory-efficient on-device deployment with minimal accuracy trade-offs, supporting both video and image tasks, and demonstrates leading performance on gesture, activity, and multimodal reasoning benchmarks, alongside robust language understanding. The model leverages chain-of-thought supervision, token-budget-aware training, and a two-stage pipeline combining supervised fine-tuning with Group Relative Policy Optimization (GRPO) based reinforcement learning (Li et al., 19 Dec 2025).

1. Model Architecture and Data Pipeline

MiMo-VL-Miloco-7B is specialized from the MiMo-VL-7B backbone, inheriting a native-resolution Vision Transformer encoder and an autoregressive LLM of 7B parameters. A lightweight MLP projector is introduced between the ViT and LLM, converting visual features to a compatible latent space. Input modalities include video (up to 256 frames at 2 FPS) or static images (up to 4096 visual patches), tokenized by the ViT and projected to yield a sequence of visual tokens {v~t}t=1T\{\tilde v_t\}_{t=1}^T. These are concatenated with byte-pair-encoded text tokens {xi}i=1N\{x_i\}_{i=1}^N before input to the LLM.

The LLM processes the multi-modal sequence and produces generative predictions, optionally containing explicit chain-of-thought explanations or direct answers, guided by prompt during both training and inference. This pipeline facilitates flexible deployment across video, image, and text scenarios, retaining compatibility with a unified architecture (Li et al., 19 Dec 2025).

2. Training Strategy: Supervised Fine-Tuning and GRPO Reinforcement Learning

The model is trained in a two-stage process:

Stage 1: Supervised Fine-Tuning (SFT)

Fine-tuning uses both proprietary home-scenario datasets—including videos of daily activities (e.g., watching TV, playing on the phone) and gestures (e.g., thumbs up, OK, shaka sign)—as well as broader multimodal corpora covering VQA, object grounding, OCR, and STEM reasoning. Each example is paired with either a chain-of-thought (CoT) rationale or a token-budget-constrained direct answer prompt. Training uses standard cross-entropy on all generated tokens with AdamW (batch size 128, learning rate 1×1051\times10^{-5}, sequence length up to 32,768, warmup 0.03).

Stage 2: Reinforcement Learning via Group Relative Policy Optimization (GRPO)

To compensate for SFT-induced declines in generalization (notably on temporal grounding, GUI, and STEM reasoning), the model undergoes RL with GRPO. For each query qq, groups of output sequences {oi}i=1G\{o_i\}_{i=1}^G are sampled from the old policy. The importance ratio ri(θ)r_i(\theta) and group-normalized advantage A^i\hat A_i are computed:

ri(θ)=πθ(oiq)πθold(oiq)r_i(\theta)=\frac{\pi_\theta(o_i\mid q)}{\pi_{\theta_{\rm old}}(o_i\mid q)}

A^i=rirawrˉstd({rjraw})\hat A_i = \frac{r_i^\text{raw}-\bar r}{\mathrm{std}(\{r_j^\text{raw}\})}

JGRPO(θ)=Eq,{oi}πθold[1Gi=1Gmin(ri(θ)A^i,clip(ri(θ),1ϵ,1+ϵ)A^i)βDKL(πθ(q)πref(q))]J_{\rm GRPO}(\theta) = \mathbb{E}_{q,\{o_i\}\sim\pi_{\theta_{\rm old}}} \left[ \frac{1}{G}\sum_{i=1}^{G} \min\left( r_i(\theta)\hat A_i,\, \mathrm{clip}(r_i(\theta),1-\epsilon,1+\epsilon)\hat A_i \right) - \beta D_{\rm KL}\left(\pi_\theta(\cdot\mid q)\|\pi_{\rm ref}(\cdot\mid q)\right) \right]

Multi-part rewards include task accuracy (e.g., 1D/2D IoU for grounding, exact match for math Q&A) and answer format compliance, weighted by λacc=0.9\lambda_{\rm acc}=0.9, λfmt=0.1\lambda_{\rm fmt}=0.1, enabling efficient task generalization (Li et al., 19 Dec 2025).

3. Chain-of-Thought Supervision and Token-Budget Awareness

Every home-scenario fine-tuning example is annotated either with a stepwise CoT explanation followed by the label or with a succinct answer constrained by a token budget. This dual supervision regime enables the model to internalize explicit logical reasoning while also optimizing for concise outputs. Inference can toggle between verbose CoT (for traceability) and direct answers (for low-latency applications).

This training regimen supports interpretability for downstream applications that require reasoning explanations, while permitting edge deployments where resource constraints dictate minimal response latency (Li et al., 19 Dec 2025).

4. Quantization (GGUF) and Real-World Deployment

The GGUF variant quantizes the LLM weights (typically blockwise Int4 or Int8) while preserving the visual encoder in FP16. The quantized model reduces memory footprint from approximately 14 GB (FP16) to 4 GB, with a 30–40% reduction in end-to-end inference latency, at a cost of ≤1 point drop in F1 on gesture benchmarks and ≤0.5% on MMMU-Pro accuracy.

This enables deployment on memory-constrained devices without substantial loss of inference quality, supporting real-time smart-home scenarios and broader accessibility (Li et al., 19 Dec 2025).

5. Benchmark Performance and Evaluation

MiMo-VL-Miloco-7B demonstrates leading results across home-scenario, video, and language benchmarks:

Task/Benchmark MiMo-VL-Miloco-7B Baseline (Gemini-2.5/Qwen2.5) MiMo-VL-SFT
Watch TV F1 98.3 93.8 86.0
Shaka Sign F1 88.3 70.1 67.4
MMMU-Prostd_\mathrm{std} 55.7 37.8 (Gemma-3) 43.0
Video-MMMU 63.6 47.4 (Qwen2.5) 57.6
Charades-STA 46.6 43.6 44.3
MMLU-Pro (EM) 68.5 48.7 (Qwen2.5) 67.1
ScreenSpot 89.8 84.7 (Qwen2.5) 89.5
DocVQA 95.2 95.5
MATH500 (Pass@1) 95.2 96.8

Improvement is most pronounced on gestures (up to +18 F1 vs. baseline), daily activity recognition, and multimodal reasoning tasks (e.g., +6 points on Video-MMMU after RL), with only minor regression on document/OCR tasks (–1–2 pts) (Li et al., 19 Dec 2025).

6. Trade-offs, Limitations, and Future Directions

Home-scenario specialization introduces notable accuracy gains for domestic activities and gestures, while naïve supervised fine-tuning may marginally impact document-centric and general reasoning benchmarks. RL with GRPO substantially mitigates these gaps for video and GUI grounding.

The integration of chain-of-thought and token-budget aware supervision enables a functional balance between interpretability and inference efficiency. Foreseeable limitations include documented minor trade-offs on document/OCR-dominated tasks and the current absence of audio or mm-wave input modalities.

Proposed directions for enhancement include extending multi-modal input coverage, exploring more aggressive compression, and deploying multi-objective RL to further close remaining performance gaps outside the home-scenario domain (Li et al., 19 Dec 2025).

7. Resources and Reproducibility

Model checkpoints—including MiMo-VL-Miloco-7B and the GGUF-quantized variant—along with an extensive home-scenario evaluation suite, are publicly released at https://github.com/XiaoMi/xiaomi-mimo-vl-miloco. This facilitates research and application in smart-home environments, promoting reproducibility and benchmarking for the broader vision-language modeling community (Li et al., 19 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MiMo-VL-Miloco-7B-GGUF.