AuroraEdge-V-2B: Efficient Edge VLLM
- AuroraEdge-V-2B is a visual large language model with 1.90B parameters, designed for efficient multimodal processing on edge devices.
- Its compression-fusion strategy reduces visual tokens from 256 to 64, cutting FLOPs by nearly 90% while ensuring robust performance.
- Benchmarked on 11 datasets, the model delivers superior speed and accuracy for industrial inspection, OCR, and on-device robotics applications.
AuroraEdge-V-2B defines a compact and high-throughput Visual LLM (VLLM) tailored for real-time, resource-constrained edge deployment. Developed on the LLaVA paradigm, AuroraEdge-V-2B introduces a compression-fusion strategy that significantly reduces computational load and inference latency while maintaining performance across multimodal tasks. With approximately 1.90 billion parameters, it achieves superior throughput and benchmark accuracy relative to competing models in its class, explicitly targeting industrial visual-inspection, real-time OCR, and on-device robotics scenarios (Chen, 23 Jan 2026).
1. Model Architecture and Component Breakdown
AuroraEdge-V-2B employs a modular design featuring five principal components. The Vision Encoder utilizes SigLIP2-so400m-patch16-naflex, a ViT variant pretrained on large-scale image-text datasets, to extract up to visual tokens . These are projected via a two-layer MLP into token embeddings , suitable for language interaction.
Compression is achieved by reducing the visual tokens from to using another MLP, yielding for later fusion. The Fusion Module implements a "Combined" approach—a fusion of cross-attention and a single-layer Transformer decoder—injecting enriched visual signals into text embeddings as
The cross-attention branch computes
This is concatenated with compressed tokens and decoded via a Qwen2.5–1.5B Transformer stack, cumulatively totaling B parameters.
The inference workflow proceeds through seven deterministic steps:
- Text embedding:
- Visual encoding:
- Projection:
- Compression:
- Fusion:
- Concatenation:
- Decoding:
This architecture enables efficient multimodal exchange and nearly halves the floating-point operations required during inference.
2. Compression-Fusion Strategy
AuroraEdge-V-2B's defining methodological innovation is the compression-fusion method. Compression reduces the decoder's visual-token load: , decreasing decoder-side FLOPs by a factor of $0.25$. The fusion module, compensating for token reduction, injects the full representations into text tokens prior to compression, maximizing information retention for downstream tasks.
Empirical FLOPs profiling highlights the magnitude of efficiency improvement:
| Model | Visual Token Count | Model GFLOPS | % |
|---|---|---|---|
| LLaVA-1.5 | 576 | — | 100% |
| AuroraEdge-V-2B | 64 | 263.8 | 11% |
This reduction aligns with the relation , where
Ablation results confirm that "Combined" fusion outperforms both isolated cross-attention and decoder strategies, offering the optimal FLOPs-accuracy tradeoff for low-latency edge scenarios.
3. Real-Time Edge Deployment and Resource Profiling
AuroraEdge-V-2B targets hardware-limited environments, such as industrial devices equipped with edge GPUs. It comprises approximately 1.90B parameters, deploys in full-precision FP32, and supports further quantization to INT8 for heightened resource efficiency.
Latency and throughput metrics demonstrate threefold speed improvement relative to contemporaneous 2–3B VLLMs, evaluated on RTX 3090 hardware with inputs at batch size one.
| Model | Params | GFLOPS | Latency (ms) |
|---|---|---|---|
| Qwen2.5-VL-3B | 3.50 B | 2288.7 | 143 |
| Qwen2-VL-2B | 2.06 B | 1645.4 | 116 |
| InternVL-2.5-2B | 1.88 B | 4091.9 | 142 |
| AuroraEdge-V-2B | 1.90 B | 263.8 | 40 |
Throughput measurements report 25 QPS (queries per second) at 40 ms latency, establishing AuroraEdge-V-2B as approximately 3× faster than the next fastest comparable model.
4. Benchmark Evaluation and Comparative Metrics
AuroraEdge-V-2B was systematically evaluated on 11 established multimodal benchmarks. Performance exceeded that of competing 2–3B parameter VLLMs on 9 out of 11 datasets, including ScienceQA, VQAV2, OKVQA, TextVQA, VIZWIZ, GQA, AI2Diagram, OCRVQA, MMBench(cn), MMBench(en), and MME.
| Dataset | Qwen2.5 | Qwen2 | Intern2.5 | AuroraEdge |
|---|---|---|---|---|
| ScienceQA | 80.49 | 77.59 | 95.80 | 76.74 |
| VQAV2 | 80.51 | 80.01 | 75.28 | 83.21 |
| OKVQA | 55.22 | 52.96 | 46.58 | 73.95 |
| TextVQA | 76.82 | 76.87 | 69.44 | 77.26 |
| VIZWIZ | 70.35 | 66.96 | 45.76 | 71.05 |
| GQA | 61.25 | 60.40 | 59.30 | 65.75 |
| AI2Diagram | 60.85 | 55.35 | 73.55 | 92.00 |
| OCRVQA | 74.53 | 74.17 | 32.02 | 68.15 |
| MMBench(cn) | 82.98 | 74.24 | 74.29 | 92.39 |
| MMBench(en) | 82.07 | 73.36 | 71.76 | 83.72 |
| MME | 85.72 | 87.99 | 80.79 | 94.06 |
These results reflect robust generalization as well as cross-task flexibility, surpassing similarly sized models such as Qwen2-VL-2B, Qwen2.5-VL-3B, and InternVL-2.5-2B for most metrics.
5. Industrial Applications and Limitation Analysis
AuroraEdge-V-2B is deployed in edge scenarios such as industrial inspection, automated report generation, on-device factory OCR, and robotics perception—domains requiring multimodal understanding under real-time and resource-limited constraints.
Use cases include:
- Industrial inspection and automated report generation on edge devices
- Real-time OCR and document understanding in factories
- On-device robotics perception and human–machine interaction
Limitations are primarily domain-specific. Custom-developed DLMs may outperform AuroraEdge-V-2B in narrowly defined tasks. The compressor ratio (256→64) and single-layer fusion are currently fixed; deeper fusion architectures and higher compression rates constitute plausible future optimizations. The model lacks native video support; incorporation of temporal input modalities remains future work. Furthermore, connectors are trained exclusively with text loss; supplementing with visual-reconstruction objectives may enrich compressed tokens.
6. Context and Prospects
AuroraEdge-V-2B demonstrates that aggressive token-level compression combined with lightweight fusion architecture can reduce inference FLOPs by nearly 90%, deliver threefold speedup on edge GPUs, and maintain competitive accuracy on diverse multimodal benchmarks. This approach evidences a shift toward generalizable and efficient VLLMs for industrial and robotics contexts, highlighting the potential for further innovation in compact model architectures for edge deployment (Chen, 23 Jan 2026).