HyperVL: On-Device Multimodal LLM
- HyperVL is a compact multimodal LLM designed for resource-constrained edge devices with adaptive vision-language processing.
- It employs a Visual Resolution Compressor and dual Vision Transformer branches to optimize image tiling and reduce memory usage.
- Quantization and serial tile processing yield a 12.9× latency speedup while maintaining high accuracy in OCR and document tasks.
HyperVL is a compact (<2B parameters) multimodal LLM designed for real-time, on-device inference on resource-constrained environments. Focusing on efficient vision-language understanding, HyperVL introduces a combination of architecture and training innovations—including a Visual Resolution Compressor (VRC), dual Vision Transformer (ViT) branches, and Dual Consistency Learning (DCL)—to achieve cloud-comparable perceptual and reasoning performance within the memory, latency, and power envelopes of modern mobile NPUs (Team et al., 16 Dec 2025).
1. System Architecture and Image Processing Pipeline
The HyperVL system is structured around four main components: the Visual Resolution Compressor (VRC), two Vision Transformer encoders (a SigLIP2-Base and a SigLIP2-Large variant), a vision-language projector, and a shared Qwen3-1.7B LLM. The input image processing flow includes AnyRes scaling (for aspect ratio preservation), zero-padded resizing to a multiple of the ViT patch size, and partition into non-overlapping tiles (e.g., 224×224). Each tile is processed serially to strictly cap peak memory usage on edge NPUs, thus preventing expensive quadratic scaling in memory with image resolution typically seen in full-image ViT inference.
The inference procedure follows:
- The VRC predicts the optimal downsampling ratio , balancing input quality and accuracy constraints.
- The image is resized by , padded, and split into tiles.
- Each tile is encoded by the selected ViT branch.
- Visual tokens are projected and compressed via an MLP with pixel-shuffle reduction.
- Visual tokens are concatenated with text tokens and passed to the shared LLM for output generation.
This modular and hardware-aware pipeline enforces localized memory and computation by operating on one tile per step, thus avoiding large DDR–VTCM memory transfers typical in large vision transformers.
2. Visual Resolution Compressor (VRC)
The VRC is a MobileNet-based neural network plug-in, trained to predict, for any input image, the maximum allowable downsampling ratio such that the LLM’s cross-entropy loss increases by no more than a fixed threshold . Supervision is generated by evaluating multiple compressed versions of the original image, computing
and defining .
For training, the normalized compression label is
for constant VRC input size (with the original image dimensions). The training objective is mean squared error between predicted and target .
At inference, VRC yields average reductions of over 20% in vision token count (up to ≈63% for simple documents) while maintaining approximately 98.7% of baseline accuracy. The module adds <2 ms computational overhead and is quantizable for deployment in mobile scenarios.
3. Dual Consistency Learning (DCL) Framework
HyperVL integrates two parallel ViT encoding branches: a base (SigLIP2-Base, 93M parameters) and a large (SigLIP2-Large, 300M parameters). Both are trained to share a single Qwen3-1.7B LLM “head.” DCL enables dynamic runtime switching between branches depending on latency or power constraints.
Two core strategies underpin DCL:
- Alternating Dual-Branch Training: Each batch activates only one ViT branch (alternating between base and large). Image-text cross-entropy loss is computed as usual:
- Semantic Consistency Distillation: The large-branch (teacher, ) next-token distributions serve as soft targets for the small-branch (student, ), with optimization of KL divergence:
DCL is applied on text tokens only; the combined objective for small-branch batches is
DCL allows the small branch to closely track the semantic performance of the larger branch, in practice resulting in a gain of up to 22 points on OCRBench and 5 points on AI2D for the base model.
4. Training Procedure and Quantized On-Device Inference
HyperVL’s pretraining sequence involves three main stages: alignment, knowledge enhancement, and multitask learning. Each stage employs a mix of frozen/unfrozen modules and learning rates as specified in the original protocol. Training alternates randomly between activating base and large ViT branches within each batch.
On-device inference is fully serial, designed for NPU-friendly workloads:
- VRC resolution prediction (~2 ms); adaptive image resizing and tiling;
- ViT branch selection based on budgetary constraints;
- Serial tile-based encoding and pixel-shuffle-based token compression (4× reduction);
- Output generation via shared LLM;
- W4A16 quantization (4-bit weights/16-bit activations) is used to reduce memory bandwidth with negligible accuracy penalty: e.g., DocVQA (91.3→91.2), ChartQA (83.8→83.2), OCRBench (830→815).
This approach yields peak memory usage under ~200MB for base branch and a ≈12.9× latency speedup over standard large-ViT-based MLLMs at 1024×1024 resolution.
5. Evaluation and Comparative Performance
On the OpenCompass public suite (10 benchmarks), HyperVL Base achieves an average score of 64.5, outperforming or matching several 2B–3B parameter methods, with the large-branch at 66.1. Specific results include:
- OCR/Document: AI2D (81.8), ChartQA (83.8), DocVQA (91.3)
- Math/Reasoning: MathVista (66.2), MMMU (43.4)
- Hallucination handling: HallBench (51.5), CRPE (62.7), POPE (~88.9)
On proprietary benchmarks, HyperVL Base reaches first place in image-text creation (49.8) and relevance ranking (51.5), as well as second in intent recognition (94.0) and UI understanding (84.2). Ablation shows a 20.2% reduction in visual tokens with VRC and 98.7% of baseline accuracy retention.
6. Memory, Power, and Deployment Considerations
HyperVL’s serial tiling and local attention design are specifically tuned to fit the VTCM on NPUs such as the Qualcomm 8750. Pixel-shuffling and activation capping lower KV-cache pressure. The self-contained VRC plug-in operates without modifying pretrained MLLMs and deploys in <2 ms. Quantized operators (via ONNX or vendor runtimes) maintain accuracy with reduced memory and compute requirements. Dynamic branch switching enables further adaptation to runtime latency or power constraints.
7. Significance and Implications
HyperVL demonstrates that hardware-aware architectural, training, and compression strategies can bridge the gap between the emerging capabilities of multimodal LLMs and the resource budgets of modern mobile environments. The combination of adaptive visual resolution, token reduction, dual-branch consistency, and quantization enables edge deployment without the typical trade-off in multimodal reasoning capability. A plausible implication is that similar modular tiling and adaptive resolution frameworks could generalize across related domains where large-scale visual language processing must be performed under strict inference or power budgets (Team et al., 16 Dec 2025).