AuroraEdge-V-2B: Efficient Edge VLLM

Updated 30 January 2026

AuroraEdge-V-2B is a visual large language model with 1.90B parameters, designed for efficient multimodal processing on edge devices.
Its compression-fusion strategy reduces visual tokens from 256 to 64, cutting FLOPs by nearly 90% while ensuring robust performance.
Benchmarked on 11 datasets, the model delivers superior speed and accuracy for industrial inspection, OCR, and on-device robotics applications.

AuroraEdge-V-2B defines a compact and high-throughput Visual LLM (VLLM) tailored for real-time, resource-constrained edge deployment. Developed on the LLaVA paradigm, AuroraEdge-V-2B introduces a compression-fusion strategy that significantly reduces computational load and inference latency while maintaining performance across multimodal tasks. With approximately 1.90 billion parameters, it achieves superior throughput and benchmark accuracy relative to competing models in its class, explicitly targeting industrial visual-inspection, real-time OCR, and on-device robotics scenarios (Chen, 23 Jan 2026).

1. Model Architecture and Component Breakdown

AuroraEdge-V-2B employs a modular design featuring five principal components. The Vision Encoder utilizes SigLIP2-so400m-patch16-naflex, a ViT variant pretrained on large-scale image-text datasets, to extract up to $N_v = 256$ visual tokens $H_v \in \mathbb{R}^{256 \times D_v}$ . These are projected via a two-layer MLP into token embeddings $H_{v2t} \in \mathbb{R}^{N_v \times D_t}$ , suitable for language interaction.

Compression is achieved by reducing the visual tokens from $N_v = 256$ to $N_c = 64$ using another MLP, yielding $H_{vc} \in \mathbb{R}^{64 \times D_t}$ for later fusion. The Fusion Module implements a "Combined" approach—a fusion of cross-attention and a single-layer Transformer decoder—injecting enriched visual signals into text embeddings as

$H_{ft} = F_{\rm Cross}(H_{v2t}, H_t) + F_{\rm Decoder}(H_{v2t}, H_t)$

The cross-attention branch computes

$Q_t = W^Q H_t,\quad K_v = W^K H_{v2t},\quad V_v = W^V H_{v2t}$

$H_{ft}^{(\rm cross)} = \mathrm{Softmax}\!\left(\frac{Q_t K_v^\top}{\sqrt{D_t}}\right) V_v$

This is concatenated with compressed tokens and decoded via a Qwen2.5–1.5B Transformer stack, cumulatively totaling $\approx 1.90$ B parameters.

The inference workflow proceeds through seven deterministic steps:

Text embedding: $H_t = F_{embeds}(X_t)$
Visual encoding: $H_v = F_{ve}(X_v)$
Projection: $H_{v2t} = F_{proj}(H_v)$
Compression: $H_{vc} = F_{compress}(H_{v2t})$
Fusion: $H_{ft} = F_{fusion}(H_{v2t}, H_t)$
Concatenation: $H_m = \mathrm{Concat}(H_{vc}, H_{ft})$
Decoding: $Y = F_{decode}(H_m)$

This architecture enables efficient multimodal exchange and nearly halves the floating-point operations required during inference.

2. Compression-Fusion Strategy

AuroraEdge-V-2B's defining methodological innovation is the compression-fusion method. Compression reduces the decoder's visual-token load: $N_v = 256 \rightarrow N_c = 64$ , decreasing decoder-side FLOPs by a factor of $0.25$. The fusion module, compensating for token reduction, injects the full $H_{v2t}$ representations into text tokens prior to compression, maximizing information retention for downstream tasks.

Empirical FLOPs profiling highlights the magnitude of efficiency improvement:

Model	Visual Token Count	Model GFLOPS	%
LLaVA-1.5	576	—	100%
AuroraEdge-V-2B	64	263.8	11%

This reduction aligns with the relation $\mathrm{Token\text{-}FLOPs} \propto N \cdot D^2 \cdot L$ , where

$\frac{64}{576} \approx 0.11$

Ablation results confirm that "Combined" fusion outperforms both isolated cross-attention and decoder strategies, offering the optimal FLOPs-accuracy tradeoff for low-latency edge scenarios.

3. Real-Time Edge Deployment and Resource Profiling

AuroraEdge-V-2B targets hardware-limited environments, such as industrial devices equipped with edge GPUs. It comprises approximately 1.90B parameters, deploys in full-precision FP32, and supports further quantization to INT8 for heightened resource efficiency.

Latency and throughput metrics demonstrate threefold speed improvement relative to contemporaneous 2–3B VLLMs, evaluated on RTX 3090 hardware with $640\times480$ inputs at batch size one.

Model	Params	GFLOPS	Latency (ms)
Qwen2.5-VL-3B	3.50 B	2288.7	143
Qwen2-VL-2B	2.06 B	1645.4	116
InternVL-2.5-2B	1.88 B	4091.9	142
AuroraEdge-V-2B	1.90 B	263.8	40

Throughput measurements report $\approx$ 25 QPS (queries per second) at 40 ms latency, establishing AuroraEdge-V-2B as approximately 3× faster than the next fastest comparable model.

4. Benchmark Evaluation and Comparative Metrics

AuroraEdge-V-2B was systematically evaluated on 11 established multimodal benchmarks. Performance exceeded that of competing 2–3B parameter VLLMs on 9 out of 11 datasets, including ScienceQA, VQAV2, OKVQA, TextVQA, VIZWIZ, GQA, AI2Diagram, OCRVQA, MMBench(cn), MMBench(en), and MME.

Dataset	Qwen2.5	Qwen2	Intern2.5	AuroraEdge
ScienceQA	80.49	77.59	95.80	76.74
VQAV2	80.51	80.01	75.28	83.21
OKVQA	55.22	52.96	46.58	73.95
TextVQA	76.82	76.87	69.44	77.26
VIZWIZ	70.35	66.96	45.76	71.05
GQA	61.25	60.40	59.30	65.75
AI2Diagram	60.85	55.35	73.55	92.00
OCRVQA	74.53	74.17	32.02	68.15
MMBench(cn)	82.98	74.24	74.29	92.39
MMBench(en)	82.07	73.36	71.76	83.72
MME	85.72	87.99	80.79	94.06

These results reflect robust generalization as well as cross-task flexibility, surpassing similarly sized models such as Qwen2-VL-2B, Qwen2.5-VL-3B, and InternVL-2.5-2B for most metrics.

5. Industrial Applications and Limitation Analysis

AuroraEdge-V-2B is deployed in edge scenarios such as industrial inspection, automated report generation, on-device factory OCR, and robotics perception—domains requiring multimodal understanding under real-time and resource-limited constraints.

Use cases include:

Industrial inspection and automated report generation on edge devices
Real-time OCR and document understanding in factories
On-device robotics perception and human–machine interaction

Limitations are primarily domain-specific. Custom-developed DLMs may outperform AuroraEdge-V-2B in narrowly defined tasks. The compressor ratio (256→64) and single-layer fusion are currently fixed; deeper fusion architectures and higher compression rates constitute plausible future optimizations. The model lacks native video support; incorporation of temporal input modalities remains future work. Furthermore, connectors are trained exclusively with text loss; supplementing with visual-reconstruction objectives may enrich compressed tokens.

6. Context and Prospects

AuroraEdge-V-2B demonstrates that aggressive token-level compression combined with lightweight fusion architecture can reduce inference FLOPs by nearly 90%, deliver threefold speedup on edge GPUs, and maintain competitive accuracy on diverse multimodal benchmarks. This approach evidences a shift toward generalizable and efficient VLLMs for industrial and robotics contexts, highlighting the potential for further innovation in compact model architectures for edge deployment (Chen, 23 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

AuroraEdge-V-2B: A Faster And Stronger Edge Visual Large Language Model (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AuroraEdge-V-2B.

AuroraEdge-V-2B: Efficient Edge VLLM

1. Model Architecture and Component Breakdown

2. Compression-Fusion Strategy

3. Real-Time Edge Deployment and Resource Profiling

4. Benchmark Evaluation and Comparative Metrics

5. Industrial Applications and Limitation Analysis

6. Context and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AuroraEdge-V-2B: Efficient Edge VLLM

1. Model Architecture and Component Breakdown

2. Compression-Fusion Strategy

3. Real-Time Edge Deployment and Resource Profiling

4. Benchmark Evaluation and Comparative Metrics

5. Industrial Applications and Limitation Analysis

6. Context and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research