Papers
Topics
Authors
Recent
Search
2000 character limit reached

Edge VLA (EVLA) for Real-Time Robotic Control

Updated 4 July 2026
  • Edge VLA (EVLA) is a class of models that integrate vision, language, and action to enable real-time robotic control on resource-constrained hardware.
  • The approach uses non-autoregressive prediction and a small language model to achieve up to 7× faster inference and significant memory savings.
  • EVLA systems leverage edge-cloud collaboration and dynamic model partitioning to balance latency, compute, and memory constraints effectively.

Edge VLA (EVLA, “Edge Vision-Language-Action”) denotes Vision-Language-Action models and deployment schemes designed for real-time robotic control on resource-constrained hardware. In the narrow sense, the term refers to the "EdgeVLA" model, which targets real-time inference on edge devices by combining non-autoregressive 7-DoF prediction with a Small LLM; in a broader systems sense, it encompasses analytical performance models, edge-cloud collaborative execution, asynchronous adapters, and hierarchical perception-control architectures intended to make embodied-policy inference feasible under stringent latency, memory, and bandwidth constraints (Budzianowski et al., 18 Jul 2025).

1. Definition and operational scope

Within embodied AI, EVLA is motivated by the requirement that robots and drones must perceive, reason, and act in the real world at control rates of 1020 Hz10\text{–}20\ \mathrm{Hz} without off-board assistance. This requirement is difficult to satisfy because Vision-Language-Action models are large, multimodal, and often decode actions through sparse, memory-bound execution paths. The edge setting therefore includes not only compact model design, but also inference-system design, memory-bandwidth analysis, and policies for deciding whether computation should occur on-device, on a nearby server, or in the cloud (Vishwanathan et al., 1 Mar 2026).

The broader EVLA literature treats end-to-end latency as the primary systems objective. In VLA-Perf, the total latency is modeled as

Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},

with per-operator latency

To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),

and network-transfer latency

Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.

This formalization places EVLA at the intersection of model architecture, accelerator characteristics, and network conditions rather than treating it as a purely algorithmic problem (Jiang et al., 20 Feb 2026).

A recurring practical distinction is between EVLA as an on-device model and EVLA as a deployment paradigm. "EdgeVLA" is a specific compact VLA model. By contrast, RAPID, RoboECC, AsyncShield, and Agile-VLA address different parts of the edge problem: chunk-level offloading, layer-wise partitioning, asynchronous cloud navigation, and asynchronous industrial pose rectification, respectively. This suggests that EVLA is best understood as a systems category rather than a single architecture.

2. Model architecture and non-autoregressive action generation

The EdgeVLA model is designed end-to-end for real-time robotic control on resource-constrained hardware such as mobile manipulators and Jetson Nano. Its stated goals are to preserve the representational power of large VLA models like OpenVLA (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params}) while achieving real-time inference (2050 Hz)(20\text{–}50\ \mathrm{Hz}) and dramatic memory savings (<5 GB)(< 5\ \mathrm{GB}). The two core innovations are non-autoregressive joint prediction of 7-DoF end-effector poses and replacement of a large language backbone with the Small LLM Qwen2-0.5B (Budzianowski et al., 18 Jul 2025).

The architecture uses a two-part frozen vision encoder consisting of SigLIP (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters} and DINOv2 (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}. Visual embeddings are projected into the language-model token space through learned linear layers W1,W2W_1, W_2, concatenated with text tokens, and processed by a Qwen2 encoder. After fusion, the model branches into two output heads: an action-token head with cross-entropy over a discrete action vocabulary, and a 7-dimensional end-effector head with no causal attention mask over the regression head (Budzianowski et al., 18 Jul 2025).

The central modeling change is removal of token-by-token pose decoding. Standard VLA formulations emit the 7-dimensional end-effector pose Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},0 autoregressively: Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},1 EdgeVLA instead predicts all 7 dimensions in parallel: Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},2 Because no autoregressive loop over 7 tokens is needed, inference throughput improves by roughly the number of removed steps: Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},3 On A100-40 GB, the reported inference time is Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},4 for EVLA versus Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},5 for OpenVLA, with Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},6 versus Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},7 memory usage (Budzianowski et al., 18 Jul 2025).

Training proceeds in two phases. Phase 1 performs multimodal pretraining of a VLM on Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},8 image-caption pairs, using standard next-token cross-entropy on text with frozen vision backbones except projection layers. Phase 2 fine-tunes on Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},9 robot trajectories from OpenX Embodiment and BridgeData V2. The losses are

To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),0

To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),1

and

To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),2

The optimizer is AdamW, with To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),3 schedule and weight decay To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),4; visual augmentations are random crops, color-jitter, and Gaussian noise. On BridgeData V2, EVLA’s cross-entropy and token-accuracy curves nearly overlap OpenVLA’s, and on full OpenX it trains To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),5 faster per iteration while reaching similar stagnation behavior (Budzianowski et al., 18 Jul 2025).

3. Systems bottlenecks on edge hardware

A central finding across the EVLA systems literature is that action generation, not perception, is the dominant performance bottleneck. Vishwanathan et al. characterize MolmoAct-7B on Jetson Orin and Thor and report that on Orin the phase breakdown is approximately To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),6 for perception, To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),7 for reasoning, and To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),8 for action generation, with the action-generation phase consuming up to To=max ⁣(FLOPsoFLOP/sh, BytesoMemBWh),T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),9 of end-to-end latency. Total latency is Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.0 on Orin and Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.1 on Thor, despite Thor’s much higher raw compute (Vishwanathan et al., 1 Mar 2026).

The analytical explanation is a roofline-style lower bound: Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.2 For action generation, the memory term dominates in practice: Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.3 This is consistent with VLA-Perf, which characterizes the action expert as having low operator intensity Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.4 and therefore being memory-bound on all GPUs, whereas the vision encoder and VLM backbone have high operator intensity Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.5 and can be compute-bound on RTX 4090+ but memory-bound on Thor (Jiang et al., 20 Feb 2026).

These bottlenecks explain why compact EVLA architectures often focus on action generation rather than only shrinking the vision-language stack. EdgeVLA removes an autoregressive loop over 7 output tokens; RoboECC models compute and data movement per layer for split-point selection; RAPID exploits step-wise redundancy inside action chunks; and AsyncShield treats delayed cloud outputs as stale geometric intents to be realigned on the edge. A plausible implication is that EVLA progress depends at least as much on reducing data movement, decode serialization, and synchronization cost as on reducing parameter count.

Scaling studies reinforce this constraint. VLA-Perf reports that on Jetson Thor Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.6, a Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.7-parameter VLA runs at Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.8, but a Tdnet=NetLat+BytesdNetBW.T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.9-parameter variant drops to (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params})0. It also reports that B100 sustains (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params})1 on an (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params})2-parameter VLA, while Thor and RTX 4090 cannot host (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params})3-parameter models in real time (Jiang et al., 20 Feb 2026). In the 100B-parameter projection study, even PIM-augmented systems at (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params})4 remain one order of magnitude below the (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params})5 target for 100B models (Vishwanathan et al., 1 Mar 2026).

4. Deployment architectures and partitioning strategies

The EVLA literature considers several distinct deployment patterns: fully on-device execution, nearby edge-server execution, cloud execution with asynchronous control, and explicit edge-cloud collaborative partitioning. VLA-Perf provides the clearest high-level trade-off summary. On-device Thor is best when network is below (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params})6 or the platform is very mobile, but is limited to (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params})7 on small VLA; edge-server deployment on RTX 4090 or B100 with WiFi 6/7 can easily exceed (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params})8; and cloud deployment requires asynchronous inference to achieve (7.5 B params)(\sim 7.5\ \mathrm{B}\ \text{params})9, since synchronous execution is capped by network delay (Jiang et al., 20 Feb 2026).

RoboECC addresses model partitioning directly. The VLA model is partitioned at layer index (2050 Hz)(20\text{–}50\ \mathrm{Hz})0 into an edge sub-model (2050 Hz)(20\text{–}50\ \mathrm{Hz})1 and a cloud sub-model (2050 Hz)(20\text{–}50\ \mathrm{Hz})2, with total latency

(2050 Hz)(20\text{–}50\ \mathrm{Hz})3

(2050 Hz)(20\text{–}50\ \mathrm{Hz})4

The search objective is

(2050 Hz)(20\text{–}50\ \mathrm{Hz})5

subject to cloud-load and edge-memory constraints. Structure is abstracted as (2050 Hz)(20\text{–}50\ \mathrm{Hz})6 with (2050 Hz)(20\text{–}50\ \mathrm{Hz})7, (2050 Hz)(20\text{–}50\ \mathrm{Hz})8, and (2050 Hz)(20\text{–}50\ \mathrm{Hz})9. For GPU hardware, per-layer latency is modeled by

(<5 GB)(< 5\ \mathrm{GB})0

RoboECC then augments the nominal split with a network-aware deployment adjustment loop: historical (<5 GB)(< 5\ \mathrm{GB})1 are fed to a lightweight LSTM predictor, which outputs (<5 GB)(< 5\ \mathrm{GB})2, and the split is shifted within a parameter-sharing pool according to (<5 GB)(< 5\ \mathrm{GB})3 (Zheng et al., 21 Mar 2026).

RAPID solves a different partitioning problem. Instead of splitting a model at a layer boundary, it decides at each action chunk whether to continue cached execution on the edge or offload the current observation and instruction to the cloud for a fresh chunk. The total latency is written as

(<5 GB)(< 5\ \mathrm{GB})4

under memory, bandwidth, smoothness, and robustness constraints. Its dispatcher computes a continuous Action Importance Score from acceleration- and torque-derived anomaly scores, runs in (<5 GB)(< 5\ \mathrm{GB})5 CPU complexity per step, uses only a few kilobytes for sliding windows, and does not require any forward pass through the VLA model to make offloading decisions (Zheng et al., 9 Mar 2026).

AsyncShield assumes cloud-based VLA navigation and inserts a fully edge-resident adapter between a low-frequency, high-latency cloud VLA model and the high-frequency local controller. Temporal lag is converted into a spatial offset by realigning anchor-frame waypoints: (<5 GB)(< 5\ \mathrm{GB})6 Intent restoration and physical safety are then balanced by a constrained Markov decision process solved with PPO-Lagrangian, with the safety cost derived from LiDAR and the edge action defined as a local 2D sub-goal (Yang et al., 27 Apr 2026).

Agile-VLA keeps all execution on the edge but decouples perception from control. A low-rate Perception Stream runs at (<5 GB)(< 5\ \mathrm{GB})7, while a high-rate Control Stream runs at (<5 GB)(< 5\ \mathrm{GB})8, using timestamped geometric anchors and cubic-spline interpolation to avoid closed-loop instability when (<5 GB)(< 5\ \mathrm{GB})9 (Yan et al., 24 Mar 2026).

One notable point of tension appears in the literature. VLA-Perf states that a device-server split with the VLM on the server and the action expert on the device is almost never beneficial, due to large KV cache transfer, whereas RoboECC reports gains from layer-wise edge-cloud partitioning. This suggests that the benefit of partitioning depends strongly on what is being split, how activations are transmitted, and how network variability is handled.

5. Representative empirical results

Reported EVLA results span compact model acceleration, layer-wise ECC, chunk-level dispatch, asynchronous navigation adaptation, and industrial manipulation. The table summarizes representative outcomes (Budzianowski et al., 18 Jul 2025, Zheng et al., 21 Mar 2026, Zheng et al., 9 Mar 2026, Yang et al., 27 Apr 2026, Yan et al., 24 Mar 2026).

System Setting Reported result
EdgeVLA A100-40 GB (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}0 vs. (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}1; (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}2 vs. (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}3
RoboECC Orin + A100, OpenVLA (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}4 speedup; (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}5 vs. (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}6
RAPID LIBERO / real manipulator (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}7; (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}8
AsyncShield Unitree Go2, three VLAs direct VLA only (ViT ⁣ ⁣B/16)384 M parameters(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}9 SR; (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}0 AsyncShield (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}1 SR
Agile-VLA Jetson Orin Nano (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}2, (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}3 TCP jitter, (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}4 Avg SR

For EdgeVLA specifically, the extrapolated edge figures are (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}5 (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}6 and (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}7 on Jetson Orin, and (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}8 (ViT ⁣ ⁣S/16)512 M parameters(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}9 and W1,W2W_1, W_20 on Jetson Nano. In early mobile-manipulation benchmarks—simulated pick-and-place, drawer-opening, and button-press tasks—EVLA reports success rates of W1,W2W_1, W_21 versus W1,W2W_1, W_22 for OpenVLA, within W1,W2W_1, W_23 absolute, while running in real time at W1,W2W_1, W_24 on Orin (Budzianowski et al., 18 Jul 2025).

RoboECC reports results on LIBERO with OpenVLA, on SimplerEnv with CogACT, and on a real AgileX PIPER arm performing 1,000 samples. On Orin + A100 with OpenVLA, the speedup over edge-only inference is W1,W2W_1, W_25, with total latency W1,W2W_1, W_26 versus W1,W2W_1, W_27; on Thor + A100, the speedup is W1,W2W_1, W_28. On the real robot, reported latency is W1,W2W_1, W_29 versus Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},00 edge-only for Orin + A100, and Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},01 versus Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},02 edge-only for Thor + A100. The parameter-sharing pool contributes Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},03 of model size, the LSTM is Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},04, and average split-adjust time is Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},05 versus Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},06 latency reduction (Zheng et al., 21 Mar 2026).

RAPID reports an end-to-end latency reduction of up to Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},07 over a vision-based dynamic partitioning baseline while incurring only Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},08 additional CPU/memory overhead on the edge. In LIBERO simulation, total latency falls from Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},09 for the vision-based baseline to Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},10 for RAPID; in the real-world banana-to-bowl task, it falls from Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},11 to Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},12. The ablation study reports Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},13 without the compatibility trigger, Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},14 without the redundancy trigger, and Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},15 with the dual-threshold design (Zheng et al., 9 Mar 2026).

AsyncShield emphasizes robustness under irregular network delay rather than raw model throughput. In simulation over 600 episodes, it reports Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},16 success rate, Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},17 cross-track error, and Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},18 risk exposure rate under the ideal profile; under mixed degradation it reports Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},19, Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},20, and Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},21. In real-world experiments on Unitree Go2 with SocialNav, TrackVLA, and Nav-Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},22, direct VLA-only success is Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},23, while adding AsyncShield raises success to Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},24 without fine-tuning any cloud-based foundation models (Yang et al., 27 Apr 2026).

Agile-VLA targets industrial pose rectification on Jetson Orin Nano. The asynchronous version reports Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},25 control, overall VRAM footprint Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},26, Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},27 TCP jitter, and Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},28 average success on DID-127, compared with Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},29 for OpenVLA (4-bit). It also reports Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},30 collision rate and Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},31 jerk, along with Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},32-shot fine-tuning converging in Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},33 (Yan et al., 24 Mar 2026).

6. Limitations, future directions, and acronym ambiguity

The EVLA literature is explicit about unresolved limitations. RoboECC states that its current hardware model is GPU-only, and that extending to CPU/NPU/ASICs requires new pipeline modeling. It also notes that threshold tuning Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},34 depends on the historical Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},35 distribution, and identifies reinforcement-learning or multi-objective search for split-point selection and dynamic batch-sizing as future directions (Zheng et al., 21 Mar 2026).

EdgeVLA leaves several open questions at the model level: multi-arm bimanual tasks, force/torque feedback, and on-device continual finetuning. It also reports planned CPU-only optimizations via FlexAttention and 1-bit quantization for sub-100 ms full-stack loops (Budzianowski et al., 18 Jul 2025). Agile-VLA similarly notes limitations in freely moving objects, dynamic scenes, and multi-object scenarios, and states that tactile feedback is not yet integrated (Yan et al., 24 Mar 2026). These limitations suggest that current EVLA systems are strongest in settings where task geometry, control bandwidth, and deployment topology can be constrained or structured.

A common source of confusion is the acronym itself. In radio astronomy, EVLA denotes the Expanded Very Large Array, the major upgrade of the Very Large Array that provides complete frequency coverage from Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},36 to Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},37, up to Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},38 instantaneous bandwidth per polarization, and the WIDAR correlator with standard Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},39 spectral channels per baseline and a maximum exceeding Ltotal=mMTmcompute+dDTdnet,L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},40 channels (Dougherty et al., 2010, Perley et al., 2011). In embodied AI and robotics, by contrast, EVLA refers to Edge Vision-Language-Action. The two usages are unrelated except for the shared acronym.

In the robotics sense, EVLA has evolved into a technical program organized around a single question: how to preserve the generality and capability of large VLA models while meeting real-time control requirements on edge platforms. The present literature answers that question with several non-exclusive strategies—non-autoregressive action prediction, compact language backbones, roofline-guided performance analysis, chunk-level or layer-level edge-cloud collaboration, asynchronous intent realignment, and hierarchical decoupling of perception from control. The diversity of these strategies suggests that EVLA is not a settled architecture, but an active systems-design space defined by compute, memory, bandwidth, and control-loop constraints.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Edge VLA (EVLA).