Papers
Topics
Authors
Recent
Search
2000 character limit reached

AuroraEdge-V-2B: Efficient Edge VLLM

Updated 30 January 2026
  • AuroraEdge-V-2B is a visual large language model with 1.90B parameters, designed for efficient multimodal processing on edge devices.
  • Its compression-fusion strategy reduces visual tokens from 256 to 64, cutting FLOPs by nearly 90% while ensuring robust performance.
  • Benchmarked on 11 datasets, the model delivers superior speed and accuracy for industrial inspection, OCR, and on-device robotics applications.

AuroraEdge-V-2B defines a compact and high-throughput Visual LLM (VLLM) tailored for real-time, resource-constrained edge deployment. Developed on the LLaVA paradigm, AuroraEdge-V-2B introduces a compression-fusion strategy that significantly reduces computational load and inference latency while maintaining performance across multimodal tasks. With approximately 1.90 billion parameters, it achieves superior throughput and benchmark accuracy relative to competing models in its class, explicitly targeting industrial visual-inspection, real-time OCR, and on-device robotics scenarios (Chen, 23 Jan 2026).

1. Model Architecture and Component Breakdown

AuroraEdge-V-2B employs a modular design featuring five principal components. The Vision Encoder utilizes SigLIP2-so400m-patch16-naflex, a ViT variant pretrained on large-scale image-text datasets, to extract up to Nv=256N_v = 256 visual tokens HvR256×DvH_v \in \mathbb{R}^{256 \times D_v}. These are projected via a two-layer MLP into token embeddings Hv2tRNv×DtH_{v2t} \in \mathbb{R}^{N_v \times D_t}, suitable for language interaction.

Compression is achieved by reducing the visual tokens from Nv=256N_v = 256 to Nc=64N_c = 64 using another MLP, yielding HvcR64×DtH_{vc} \in \mathbb{R}^{64 \times D_t} for later fusion. The Fusion Module implements a "Combined" approach—a fusion of cross-attention and a single-layer Transformer decoder—injecting enriched visual signals into text embeddings as

Hft=FCross(Hv2t,Ht)+FDecoder(Hv2t,Ht)H_{ft} = F_{\rm Cross}(H_{v2t}, H_t) + F_{\rm Decoder}(H_{v2t}, H_t)

The cross-attention branch computes

Qt=WQHt,Kv=WKHv2t,Vv=WVHv2tQ_t = W^Q H_t,\quad K_v = W^K H_{v2t},\quad V_v = W^V H_{v2t}

Hft(cross)=Softmax ⁣(QtKvDt)VvH_{ft}^{(\rm cross)} = \mathrm{Softmax}\!\left(\frac{Q_t K_v^\top}{\sqrt{D_t}}\right) V_v

This is concatenated with compressed tokens and decoded via a Qwen2.5–1.5B Transformer stack, cumulatively totaling 1.90\approx 1.90B parameters.

The inference workflow proceeds through seven deterministic steps:

  1. Text embedding: Ht=Fembeds(Xt)H_t = F_{embeds}(X_t)
  2. Visual encoding: Hv=Fve(Xv)H_v = F_{ve}(X_v)
  3. Projection: Hv2t=Fproj(Hv)H_{v2t} = F_{proj}(H_v)
  4. Compression: Hvc=Fcompress(Hv2t)H_{vc} = F_{compress}(H_{v2t})
  5. Fusion: Hft=Ffusion(Hv2t,Ht)H_{ft} = F_{fusion}(H_{v2t}, H_t)
  6. Concatenation: Hm=Concat(Hvc,Hft)H_m = \mathrm{Concat}(H_{vc}, H_{ft})
  7. Decoding: Y=Fdecode(Hm)Y = F_{decode}(H_m)

This architecture enables efficient multimodal exchange and nearly halves the floating-point operations required during inference.

2. Compression-Fusion Strategy

AuroraEdge-V-2B's defining methodological innovation is the compression-fusion method. Compression reduces the decoder's visual-token load: Nv=256Nc=64N_v = 256 \rightarrow N_c = 64, decreasing decoder-side FLOPs by a factor of $0.25$. The fusion module, compensating for token reduction, injects the full Hv2tH_{v2t} representations into text tokens prior to compression, maximizing information retention for downstream tasks.

Empirical FLOPs profiling highlights the magnitude of efficiency improvement:

Model Visual Token Count Model GFLOPS %
LLaVA-1.5 576 100%
AuroraEdge-V-2B 64 263.8 11%

This reduction aligns with the relation Token-FLOPsND2L\mathrm{Token\text{-}FLOPs} \propto N \cdot D^2 \cdot L, where

645760.11\frac{64}{576} \approx 0.11

Ablation results confirm that "Combined" fusion outperforms both isolated cross-attention and decoder strategies, offering the optimal FLOPs-accuracy tradeoff for low-latency edge scenarios.

3. Real-Time Edge Deployment and Resource Profiling

AuroraEdge-V-2B targets hardware-limited environments, such as industrial devices equipped with edge GPUs. It comprises approximately 1.90B parameters, deploys in full-precision FP32, and supports further quantization to INT8 for heightened resource efficiency.

Latency and throughput metrics demonstrate threefold speed improvement relative to contemporaneous 2–3B VLLMs, evaluated on RTX 3090 hardware with 640×480640\times480 inputs at batch size one.

Model Params GFLOPS Latency (ms)
Qwen2.5-VL-3B 3.50 B 2288.7 143
Qwen2-VL-2B 2.06 B 1645.4 116
InternVL-2.5-2B 1.88 B 4091.9 142
AuroraEdge-V-2B 1.90 B 263.8 40

Throughput measurements report \approx25 QPS (queries per second) at 40 ms latency, establishing AuroraEdge-V-2B as approximately 3× faster than the next fastest comparable model.

4. Benchmark Evaluation and Comparative Metrics

AuroraEdge-V-2B was systematically evaluated on 11 established multimodal benchmarks. Performance exceeded that of competing 2–3B parameter VLLMs on 9 out of 11 datasets, including ScienceQA, VQAV2, OKVQA, TextVQA, VIZWIZ, GQA, AI2Diagram, OCRVQA, MMBench(cn), MMBench(en), and MME.

Dataset Qwen2.5 Qwen2 Intern2.5 AuroraEdge
ScienceQA 80.49 77.59 95.80 76.74
VQAV2 80.51 80.01 75.28 83.21
OKVQA 55.22 52.96 46.58 73.95
TextVQA 76.82 76.87 69.44 77.26
VIZWIZ 70.35 66.96 45.76 71.05
GQA 61.25 60.40 59.30 65.75
AI2Diagram 60.85 55.35 73.55 92.00
OCRVQA 74.53 74.17 32.02 68.15
MMBench(cn) 82.98 74.24 74.29 92.39
MMBench(en) 82.07 73.36 71.76 83.72
MME 85.72 87.99 80.79 94.06

These results reflect robust generalization as well as cross-task flexibility, surpassing similarly sized models such as Qwen2-VL-2B, Qwen2.5-VL-3B, and InternVL-2.5-2B for most metrics.

5. Industrial Applications and Limitation Analysis

AuroraEdge-V-2B is deployed in edge scenarios such as industrial inspection, automated report generation, on-device factory OCR, and robotics perception—domains requiring multimodal understanding under real-time and resource-limited constraints.

Use cases include:

  • Industrial inspection and automated report generation on edge devices
  • Real-time OCR and document understanding in factories
  • On-device robotics perception and human–machine interaction

Limitations are primarily domain-specific. Custom-developed DLMs may outperform AuroraEdge-V-2B in narrowly defined tasks. The compressor ratio (256→64) and single-layer fusion are currently fixed; deeper fusion architectures and higher compression rates constitute plausible future optimizations. The model lacks native video support; incorporation of temporal input modalities remains future work. Furthermore, connectors are trained exclusively with text loss; supplementing with visual-reconstruction objectives may enrich compressed tokens.

6. Context and Prospects

AuroraEdge-V-2B demonstrates that aggressive token-level compression combined with lightweight fusion architecture can reduce inference FLOPs by nearly 90%, deliver threefold speedup on edge GPUs, and maintain competitive accuracy on diverse multimodal benchmarks. This approach evidences a shift toward generalizable and efficient VLLMs for industrial and robotics contexts, highlighting the potential for further innovation in compact model architectures for edge deployment (Chen, 23 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AuroraEdge-V-2B.