Papers
Topics
Authors
Recent
Search
2000 character limit reached

MobileVLM V2: Efficient On-Device VLM

Updated 25 February 2026
  • MobileVLM V2 is a family of vision-language models designed for efficient on-device reasoning that achieves high performance with low memory and compute requirements.
  • The methodology employs a fixed CLIP ViT-L/14 encoder, a token-reducing LDPv2 module, and adaptable language models, enhanced by a two-stage training and distillation process.
  • Empirical results show that MobileVLM V2 attains strong benchmark scores and scalable throughput, enabling practical deployment on mobile and edge devices.

MobileVLM V2 is a family of vision-LLMs (VLMs) optimized for efficient on-device reasoning under resource constraints. It demonstrates that careful architectural optimization, scalable training strategies, and high-quality data curation produce lightweight models with performance matching or surpassing larger systems. MobileVLM V2 achieves strong benchmark results while maintaining low memory and inference requirements, making it suitable for mobile and edge device deployment (Chu et al., 2024, &&&1&&&).

1. Model Architecture

MobileVLM V2 implements a modular vision-language pipeline, consisting of a frozen vision encoder, an efficient vision projector, and a compact LLM. The system is tailored for hardware efficiency, token reduction, and scalable throughput.

  • Vision Encoder: A CLIP ViT-L/14 backbone (303M parameters, frozen) extracts image features. For a 336×336 RGB image, feature embeddings fvR576×Dvf_v \in \mathbb{R}^{576 \times D_v} are produced via fv=Fenc(Xv)f_v = F_{\text{enc}}(X_v), where DvD_v is the embedding dimension and Nv=(336/14)2=576N_v = (336/14)^2 = 576.
  • Vision Projector (LDPv2): The LDPv2 module compresses visual tokens from 576 to 144 through two point-wise convolution layers, GELU activation, and a 2×22\times2 average-pooling layer:

f0=PW2(GELUPW1(fv)),f1=AvgPool2×2(f0),Hv=f1+DW(f1)f_0 = \text{PW}_2(\text{GELU} \circ \text{PW}_1(f_v)),\quad f_1 = \text{AvgPool}_{2\times2}(f_0),\quad H_v = f_1 + \text{DW}(f_1)

A lightweight depth-wise positional encoding generator (PEG, \approx0.02 M params) enhances local positional information.

  • LLM: MobileLLaMA variants (1.4B, 2.7B) or Vicuna-7B serve as the text decoder. The architecture supports scaling by swapping in larger LLMs without modifying the visual pipeline.
  • Cross-Modal Fusion: Visual tokens HvH_v and text input tokens HqH_q are concatenated and fed into the decoder-only Transformer. The cross-modal attention matrix at each layer forms a lower-triangular block, enabling vision self-attention, text self-attention, and text-to-vision cross-attention.

Only the LLM and projector parameters are updated during training; the vision encoder remains fixed.

2. Training Scheme and Distillation

MobileVLM V2 employs a two-stage training process, freezing the vision encoder at all times and tuning both the projector and LLM jointly. This training approach leads to significant improvements over previous paradigms, which typically froze the LLM during pre-training.

  • Stage 1: Pre-training
    • Dataset: 1.2M ShareGPT4V-PT image–caption pairs
    • Objective: next-token autoregressive cross-entropy loss
    • AdamW optimizer, projector learning rate 10310^{-3}, LLM learning rate 2×1052 \times 10^{-5}, 3% warm-up, cosine decay, batch size 256, 5K steps
  • Stage 2: Multi-task Instruction Tuning
    • Dataset: 2.4M samples from 8 vision-language tasks, uniformly mixed
    • Objective: continued next-token loss, learning rate 4×1054 \times 10^{-5}

Align-KD, a cross-modal distillation method (Feng et al., 2024), further transfers alignment knowledge from a 7B teacher to a 1.7B student by incorporating several loss terms:

  • First-layer cross-modal alignment: Mean-squared error (MSE) between the student and teacher text→vision attention matrices from layer 1 (LalignL_\text{align}).
  • Vision projection matching: MSE losses for matching student and teacher post-projection vision embeddings (all tokens and text-attended top-K tokens, LV-allL_{V\text{-all}}, LV-focusL_{V\text{-focus}}), weighted in combination.
  • Reverse-KLD on token logits: MiniLLM-style reverse Kullback–Leibler divergence to match output distributions (LRKLDL_\text{RKLD}).
  • Total Loss: L=LSup+Lalign+LV+LRKLDL = L_\text{Sup} + L_\text{align} + L_V + L_\text{RKLD}, with standard next-token prediction (LSupL_\text{Sup}) included.

Align-KD delivers +1.4 to +2.0 average benchmark gain, particularly improving compositional and reasoning skills, with a negligible inference overhead (Feng et al., 2024).

3. Data Curation and Benchmark Tasks

MobileVLM V2 is trained and validated using rigorously curated heterogeneous datasets:

  • Pre-training data: ShareGPT4V-PT (1.2M), consisting of GPT-4V expanded captions for diverse visual concepts, object relationships, and world-knowledge attributes.
  • Instruction tuning data: 2.4M samples balanced across eight major V+L datasets, covering Visual Dialog, TextVQA, VSR, VIGC, IConQA, SQA, COCO, SBU, ShareGPT4V-mixed. Benchmarks are selected to emphasize VQA, captioning, reasoning, and OCR.
  • Data cleaning: SBU is manually refined; GPT-4V verifies ShareGPT4V; task balancing ensures comprehensive skill coverage.

Evaluation encompasses six public benchmarks: GQA, ScienceQA-IMG (SQAI\text{SQA}^I), TextVQA (VQAT\text{VQA}^T), POPE (object hallucination), MME-Perception (MMEP\text{MME}^P), MMBench-dev.

4. Empirical Performance

MobileVLM V2 exhibits favorable trade-offs between model scale, accuracy, and computational efficiency, consistently outperforming alternatives of similar or much larger size.

Model Variant GQA SQAI VQAT POPE MMEP (scaled) MMB_dev Avg.
MobileVLM 1.7B (orig) 56.1 57.3 41.5 84.5 59.8 53.2 58.7
MobileVLM V2 1.7B 59.3 66.7 52.1 84.3 65.1 57.7 64.2
MobileVLM V2 3B 61.1 70.0 57.5 84.7 72.0 63.2 68.1
MobileVLM V2 7B 62.6 74.8 62.3 85.3 78.0 69.2 72.1
MoE-LLaVA-2.7B×4 66.7

MobileVLM V2 outperforms MoE-LLaVA-2.7B×4 by +1.4 points at the 3B scale, despite being single-expert and more efficient to deploy (Chu et al., 2024).

Align-KD further improves the 1.7B student model’s average benchmark score from 62.4 to 64.4 (Short prompt), and from 63.7 to 65.1 (Long prompt). Incremental ablations show each KD loss (reverse-KLD, shallow cross-attention, vision projection) adds 0.6 points, with first-layer text→vision attention distillation being most effective (Feng et al., 2024).

5. Efficiency and On-Device Deployment

MobileVLM V2 is designed with the operational constraints of edge hardware in mind:

  • Token Reduction: LDPv2 reduces vision tokens by 4×4\times (576 → 144), shrinking memory and compute.
  • Compact Projector: LDPv2 has 6.32M parameters versus ≈19M in earlier MobileVLM versions.
  • Throughput: On an NVIDIA A100 (batch 1, 256-token sequence), MobileVLM V2 1.7B achieves 37.4 tokens/s, 3B achieves 28.97 tokens/s, MoE-LLaVA 2.7B×4 yields 17.6 tokens/s. This results in 1.65×–2× higher throughput for MobileVLM V2 (Chu et al., 2024).
  • Mobile Hardware: Jetson Orin + 4-bit quantization yields
    • 1.7B: 64.2 avg. accuracy, 51.63 tokens/s, 4.96s per 256-token completion
    • 3B: 68.1, 30.80 tokens/s, 8.31s
    • 7B: 72.1, 15.49 tokens/s, 16.53s
    • The 7B variant is 20% faster than prior LLaVA-1.5/ShareGPT4V 7B, while being more accurate.

Student models (1.7B) require ≈6GB in fp16 and benefit from post-training quantization and pruning for further deployment gains.

6. Distillation Strategies: Align-KD Loss Design

Align-KD introduces two lightweight, targeted distillation losses for cross-modal alignment:

  • Shallow-Layer Text-to-Vision Attention Alignment: Lalign=MSE(Pattn(A1,tvT),A1,tvS)L_\text{align} = \text{MSE}(P_\text{attn}(A_{1,t\rightarrow v}^T), A_{1,t\rightarrow v}^S ), using a 1×1 convolutional adapter to match channel dimensions. Only the text→vision slice in the first layer is distilled, as ablations indicate this is optimal; vision self-attention distillation degrades performance.
  • Vision-Projection Losses: After LDP, both “all-token” (LV-allL_{V\text{-all}}) and “focused-token” (LV-focusL_{V\text{-focus}}) losses are computed, with the focus determined by the teacher’s attention distribution over visual tokens.
  • Reverse-KLD on Output: LRKLD=ypS(y)logpS(y)pT(y)L_\text{RKLD} = \sum_y p_S(y) \log \frac{p_S(y)}{p_T(y)} matches next-token distributions.

This framework imposes only ≈5% overhead at inference. The final loss for the student model is

L=LSup+Lalign+LV+LRKLDL = L_\text{Sup} + L_\text{align} + L_V + L_\text{RKLD}

where LSupL_\text{Sup} is the standard supervision loss. No architectural modification is required for the student.

7. Significance and Implications

MobileVLM V2 advances the state of compact vision-language reasoning for edge devices through concerted model, training, and data enhancements. Its efficient, hardware-friendly design and cross-modal distillation strategy enable both higher accuracy and lower latency than previous systems at equal or smaller scale. The methodology enables scaling to larger LLMs as compute budgets allow, with performance improving monotonically as LLM size increases. MobileVLM V2 and Align-KD constitute a practical blueprint for future on-device VLM research, showing that high-quality cross-modal coupling can be retained in sub-2B parameter regimes with strategic distillation and data curation (Chu et al., 2024, Feng et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MobileVLM V2.