Edge VLA (EVLA) for Real-Time Robotic Control

Updated 4 July 2026

Edge VLA (EVLA) is a class of models that integrate vision, language, and action to enable real-time robotic control on resource-constrained hardware.
The approach uses non-autoregressive prediction and a small language model to achieve up to 7× faster inference and significant memory savings.
EVLA systems leverage edge-cloud collaboration and dynamic model partitioning to balance latency, compute, and memory constraints effectively.

Edge VLA (EVLA, “Edge Vision-Language-Action”) denotes Vision-Language-Action models and deployment schemes designed for real-time robotic control on resource-constrained hardware. In the narrow sense, the term refers to the "EdgeVLA" model, which targets real-time inference on edge devices by combining non-autoregressive 7-DoF prediction with a Small LLM; in a broader systems sense, it encompasses analytical performance models, edge-cloud collaborative execution, asynchronous adapters, and hierarchical perception-control architectures intended to make embodied-policy inference feasible under stringent latency, memory, and bandwidth constraints (Budzianowski et al., 18 Jul 2025).

1. Definition and operational scope

Within embodied AI, EVLA is motivated by the requirement that robots and drones must perceive, reason, and act in the real world at control rates of $10\text{–}20\ \mathrm{Hz}$ without off-board assistance. This requirement is difficult to satisfy because Vision-Language-Action models are large, multimodal, and often decode actions through sparse, memory-bound execution paths. The edge setting therefore includes not only compact model design, but also inference-system design, memory-bandwidth analysis, and policies for deciding whether computation should occur on-device, on a nearby server, or in the cloud (Vishwanathan et al., 1 Mar 2026).

The broader EVLA literature treats end-to-end latency as the primary systems objective. In VLA-Perf, the total latency is modeled as

$L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$

with per-operator latency

$T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$

and network-transfer latency

$T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$

This formalization places EVLA at the intersection of model architecture, accelerator characteristics, and network conditions rather than treating it as a purely algorithmic problem (Jiang et al., 20 Feb 2026).

A recurring practical distinction is between EVLA as an on-device model and EVLA as a deployment paradigm. "EdgeVLA" is a specific compact VLA model. By contrast, RAPID, RoboECC, AsyncShield, and Agile-VLA address different parts of the edge problem: chunk-level offloading, layer-wise partitioning, asynchronous cloud navigation, and asynchronous industrial pose rectification, respectively. This suggests that EVLA is best understood as a systems category rather than a single architecture.

2. Model architecture and non-autoregressive action generation

The EdgeVLA model is designed end-to-end for real-time robotic control on resource-constrained hardware such as mobile manipulators and Jetson Nano. Its stated goals are to preserve the representational power of large VLA models like OpenVLA $(\sim 7.5\ \mathrm{B}\ \text{params})$ while achieving real-time inference $(20\text{–}50\ \mathrm{Hz})$ and dramatic memory savings $(< 5\ \mathrm{GB})$ . The two core innovations are non-autoregressive joint prediction of 7-DoF end-effector poses and replacement of a large language backbone with the Small LLM Qwen2-0.5B (Budzianowski et al., 18 Jul 2025).

The architecture uses a two-part frozen vision encoder consisting of SigLIP $(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ and DINOv2 $(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ . Visual embeddings are projected into the language-model token space through learned linear layers $W_1, W_2$ , concatenated with text tokens, and processed by a Qwen2 encoder. After fusion, the model branches into two output heads: an action-token head with cross-entropy over a discrete action vocabulary, and a 7-dimensional end-effector head with no causal attention mask over the regression head (Budzianowski et al., 18 Jul 2025).

The central modeling change is removal of token-by-token pose decoding. Standard VLA formulations emit the 7-dimensional end-effector pose $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 0 autoregressively: $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 1 EdgeVLA instead predicts all 7 dimensions in parallel: $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 2 Because no autoregressive loop over 7 tokens is needed, inference throughput improves by roughly the number of removed steps: $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 3 On A100-40 GB, the reported inference time is $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 4 for EVLA versus $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 5 for OpenVLA, with $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 6 versus $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 7 memory usage (Budzianowski et al., 18 Jul 2025).

Training proceeds in two phases. Phase 1 performs multimodal pretraining of a VLM on $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 8 image-caption pairs, using standard next-token cross-entropy on text with frozen vision backbones except projection layers. Phase 2 fine-tunes on $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 9 robot trajectories from OpenX Embodiment and BridgeData V2. The losses are

$T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$ 0

$T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$ 1

and

$T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$ 2

The optimizer is AdamW, with $T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$ 3 schedule and weight decay $T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$ 4; visual augmentations are random crops, color-jitter, and Gaussian noise. On BridgeData V2, EVLA’s cross-entropy and token-accuracy curves nearly overlap OpenVLA’s, and on full OpenX it trains $T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$ 5 faster per iteration while reaching similar stagnation behavior (Budzianowski et al., 18 Jul 2025).

3. Systems bottlenecks on edge hardware

A central finding across the EVLA systems literature is that action generation, not perception, is the dominant performance bottleneck. Vishwanathan et al. characterize MolmoAct-7B on Jetson Orin and Thor and report that on Orin the phase breakdown is approximately $T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$ 6 for perception, $T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$ 7 for reasoning, and $T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$ 8 for action generation, with the action-generation phase consuming up to $T_o=\max\!\left(\frac{\mathrm{FLOPs}_o}{\mathrm{FLOP/s}_h},\ \frac{\mathrm{Bytes}_o}{\mathrm{MemBW}_h}\right),$ 9 of end-to-end latency. Total latency is $T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$ 0 on Orin and $T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$ 1 on Thor, despite Thor’s much higher raw compute (Vishwanathan et al., 1 Mar 2026).

The analytical explanation is a roofline-style lower bound: $T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$ 2 For action generation, the memory term dominates in practice: $T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$ 3 This is consistent with VLA-Perf, which characterizes the action expert as having low operator intensity $T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$ 4 and therefore being memory-bound on all GPUs, whereas the vision encoder and VLM backbone have high operator intensity $T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$ 5 and can be compute-bound on RTX 4090+ but memory-bound on Thor (Jiang et al., 20 Feb 2026).

These bottlenecks explain why compact EVLA architectures often focus on action generation rather than only shrinking the vision-language stack. EdgeVLA removes an autoregressive loop over 7 output tokens; RoboECC models compute and data movement per layer for split-point selection; RAPID exploits step-wise redundancy inside action chunks; and AsyncShield treats delayed cloud outputs as stale geometric intents to be realigned on the edge. A plausible implication is that EVLA progress depends at least as much on reducing data movement, decode serialization, and synchronization cost as on reducing parameter count.

Scaling studies reinforce this constraint. VLA-Perf reports that on Jetson Thor $T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$ 6, a $T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$ 7-parameter VLA runs at $T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$ 8, but a $T_d^{\text{net}}=\mathrm{NetLat}+\frac{\mathrm{Bytes}_d}{\mathrm{NetBW}}.$ 9-parameter variant drops to $(\sim 7.5\ \mathrm{B}\ \text{params})$ 0. It also reports that B100 sustains $(\sim 7.5\ \mathrm{B}\ \text{params})$ 1 on an $(\sim 7.5\ \mathrm{B}\ \text{params})$ 2-parameter VLA, while Thor and RTX 4090 cannot host $(\sim 7.5\ \mathrm{B}\ \text{params})$ 3-parameter models in real time (Jiang et al., 20 Feb 2026). In the 100B-parameter projection study, even PIM-augmented systems at $(\sim 7.5\ \mathrm{B}\ \text{params})$ 4 remain one order of magnitude below the $(\sim 7.5\ \mathrm{B}\ \text{params})$ 5 target for 100B models (Vishwanathan et al., 1 Mar 2026).

4. Deployment architectures and partitioning strategies

The EVLA literature considers several distinct deployment patterns: fully on-device execution, nearby edge-server execution, cloud execution with asynchronous control, and explicit edge-cloud collaborative partitioning. VLA-Perf provides the clearest high-level trade-off summary. On-device Thor is best when network is below $(\sim 7.5\ \mathrm{B}\ \text{params})$ 6 or the platform is very mobile, but is limited to $(\sim 7.5\ \mathrm{B}\ \text{params})$ 7 on small VLA; edge-server deployment on RTX 4090 or B100 with WiFi 6/7 can easily exceed $(\sim 7.5\ \mathrm{B}\ \text{params})$ 8; and cloud deployment requires asynchronous inference to achieve $(\sim 7.5\ \mathrm{B}\ \text{params})$ 9, since synchronous execution is capped by network delay (Jiang et al., 20 Feb 2026).

RoboECC addresses model partitioning directly. The VLA model is partitioned at layer index $(20\text{–}50\ \mathrm{Hz})$ 0 into an edge sub-model $(20\text{–}50\ \mathrm{Hz})$ 1 and a cloud sub-model $(20\text{–}50\ \mathrm{Hz})$ 2, with total latency

$(20\text{–}50\ \mathrm{Hz})$ 3

$(20\text{–}50\ \mathrm{Hz})$ 4

The search objective is

$(20\text{–}50\ \mathrm{Hz})$ 5

subject to cloud-load and edge-memory constraints. Structure is abstracted as $(20\text{–}50\ \mathrm{Hz})$ 6 with $(20\text{–}50\ \mathrm{Hz})$ 7, $(20\text{–}50\ \mathrm{Hz})$ 8, and $(20\text{–}50\ \mathrm{Hz})$ 9. For GPU hardware, per-layer latency is modeled by

$(< 5\ \mathrm{GB})$ 0

RoboECC then augments the nominal split with a network-aware deployment adjustment loop: historical $(< 5\ \mathrm{GB})$ 1 are fed to a lightweight LSTM predictor, which outputs $(< 5\ \mathrm{GB})$ 2, and the split is shifted within a parameter-sharing pool according to $(< 5\ \mathrm{GB})$ 3 (Zheng et al., 21 Mar 2026).

RAPID solves a different partitioning problem. Instead of splitting a model at a layer boundary, it decides at each action chunk whether to continue cached execution on the edge or offload the current observation and instruction to the cloud for a fresh chunk. The total latency is written as

$(< 5\ \mathrm{GB})$ 4

under memory, bandwidth, smoothness, and robustness constraints. Its dispatcher computes a continuous Action Importance Score from acceleration- and torque-derived anomaly scores, runs in $(< 5\ \mathrm{GB})$ 5 CPU complexity per step, uses only a few kilobytes for sliding windows, and does not require any forward pass through the VLA model to make offloading decisions (Zheng et al., 9 Mar 2026).

AsyncShield assumes cloud-based VLA navigation and inserts a fully edge-resident adapter between a low-frequency, high-latency cloud VLA model and the high-frequency local controller. Temporal lag is converted into a spatial offset by realigning anchor-frame waypoints: $(< 5\ \mathrm{GB})$ 6 Intent restoration and physical safety are then balanced by a constrained Markov decision process solved with PPO-Lagrangian, with the safety cost derived from LiDAR and the edge action defined as a local 2D sub-goal (Yang et al., 27 Apr 2026).

Agile-VLA keeps all execution on the edge but decouples perception from control. A low-rate Perception Stream runs at $(< 5\ \mathrm{GB})$ 7, while a high-rate Control Stream runs at $(< 5\ \mathrm{GB})$ 8, using timestamped geometric anchors and cubic-spline interpolation to avoid closed-loop instability when $(< 5\ \mathrm{GB})$ 9 (Yan et al., 24 Mar 2026).

One notable point of tension appears in the literature. VLA-Perf states that a device-server split with the VLM on the server and the action expert on the device is almost never beneficial, due to large KV cache transfer, whereas RoboECC reports gains from layer-wise edge-cloud partitioning. This suggests that the benefit of partitioning depends strongly on what is being split, how activations are transmitted, and how network variability is handled.

5. Representative empirical results

Reported EVLA results span compact model acceleration, layer-wise ECC, chunk-level dispatch, asynchronous navigation adaptation, and industrial manipulation. The table summarizes representative outcomes (Budzianowski et al., 18 Jul 2025, Zheng et al., 21 Mar 2026, Zheng et al., 9 Mar 2026, Yang et al., 27 Apr 2026, Yan et al., 24 Mar 2026).

System	Setting	Reported result
EdgeVLA	A100-40 GB	$(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ 0 vs. $(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ 1; $(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ 2 vs. $(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ 3
RoboECC	Orin + A100, OpenVLA	$(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ 4 speedup; $(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ 5 vs. $(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ 6
RAPID	LIBERO / real manipulator	$(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ 7; $(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ 8
AsyncShield	Unitree Go2, three VLAs	direct VLA only $(\mathrm{ViT\!-\!B/16}) \rightarrow 384\ \mathrm{M}\ \text{parameters}$ 9 SR; $(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ 0 AsyncShield $(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ 1 SR
Agile-VLA	Jetson Orin Nano	$(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ 2, $(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ 3 TCP jitter, $(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ 4 Avg SR

For EdgeVLA specifically, the extrapolated edge figures are $(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ 5 $(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ 6 and $(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ 7 on Jetson Orin, and $(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ 8 $(\mathrm{ViT\!-\!S/16}) \rightarrow 512\ \mathrm{M}\ \text{parameters}$ 9 and $W_1, W_2$ 0 on Jetson Nano. In early mobile-manipulation benchmarks—simulated pick-and-place, drawer-opening, and button-press tasks—EVLA reports success rates of $W_1, W_2$ 1 versus $W_1, W_2$ 2 for OpenVLA, within $W_1, W_2$ 3 absolute, while running in real time at $W_1, W_2$ 4 on Orin (Budzianowski et al., 18 Jul 2025).

RoboECC reports results on LIBERO with OpenVLA, on SimplerEnv with CogACT, and on a real AgileX PIPER arm performing 1,000 samples. On Orin + A100 with OpenVLA, the speedup over edge-only inference is $W_1, W_2$ 5, with total latency $W_1, W_2$ 6 versus $W_1, W_2$ 7; on Thor + A100, the speedup is $W_1, W_2$ 8. On the real robot, reported latency is $W_1, W_2$ 9 versus $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 00 edge-only for Orin + A100, and $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 01 versus $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 02 edge-only for Thor + A100. The parameter-sharing pool contributes $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 03 of model size, the LSTM is $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 04, and average split-adjust time is $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 05 versus $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 06 latency reduction (Zheng et al., 21 Mar 2026).

RAPID reports an end-to-end latency reduction of up to $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 07 over a vision-based dynamic partitioning baseline while incurring only $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 08 additional CPU/memory overhead on the edge. In LIBERO simulation, total latency falls from $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 09 for the vision-based baseline to $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 10 for RAPID; in the real-world banana-to-bowl task, it falls from $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 11 to $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 12. The ablation study reports $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 13 without the compatibility trigger, $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 14 without the redundancy trigger, and $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 15 with the dual-threshold design (Zheng et al., 9 Mar 2026).

AsyncShield emphasizes robustness under irregular network delay rather than raw model throughput. In simulation over 600 episodes, it reports $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 16 success rate, $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 17 cross-track error, and $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 18 risk exposure rate under the ideal profile; under mixed degradation it reports $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 19, $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 20, and $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 21. In real-world experiments on Unitree Go2 with SocialNav, TrackVLA, and Nav- $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 22, direct VLA-only success is $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 23, while adding AsyncShield raises success to $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 24 without fine-tuning any cloud-based foundation models (Yang et al., 27 Apr 2026).

Agile-VLA targets industrial pose rectification on Jetson Orin Nano. The asynchronous version reports $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 25 control, overall VRAM footprint $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 26, $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 27 TCP jitter, and $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 28 average success on DID-127, compared with $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 29 for OpenVLA (4-bit). It also reports $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 30 collision rate and $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 31 jerk, along with $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 32-shot fine-tuning converging in $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 33 (Yan et al., 24 Mar 2026).

6. Limitations, future directions, and acronym ambiguity

The EVLA literature is explicit about unresolved limitations. RoboECC states that its current hardware model is GPU-only, and that extending to CPU/NPU/ASICs requires new pipeline modeling. It also notes that threshold tuning $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 34 depends on the historical $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 35 distribution, and identifies reinforcement-learning or multi-objective search for split-point selection and dynamic batch-sizing as future directions (Zheng et al., 21 Mar 2026).

EdgeVLA leaves several open questions at the model level: multi-arm bimanual tasks, force/torque feedback, and on-device continual finetuning. It also reports planned CPU-only optimizations via FlexAttention and 1-bit quantization for sub-100 ms full-stack loops (Budzianowski et al., 18 Jul 2025). Agile-VLA similarly notes limitations in freely moving objects, dynamic scenes, and multi-object scenarios, and states that tactile feedback is not yet integrated (Yan et al., 24 Mar 2026). These limitations suggest that current EVLA systems are strongest in settings where task geometry, control bandwidth, and deployment topology can be constrained or structured.

A common source of confusion is the acronym itself. In radio astronomy, EVLA denotes the Expanded Very Large Array, the major upgrade of the Very Large Array that provides complete frequency coverage from $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 36 to $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 37, up to $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 38 instantaneous bandwidth per polarization, and the WIDAR correlator with standard $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 39 spectral channels per baseline and a maximum exceeding $L_{\text{total}}=\sum_{m\in M} T_m^{\text{compute}} + \sum_{d\in D} T_d^{\text{net}},$ 40 channels (Dougherty et al., 2010, Perley et al., 2011). In embodied AI and robotics, by contrast, EVLA refers to Edge Vision-Language-Action. The two usages are unrelated except for the shared acronym.

In the robotics sense, EVLA has evolved into a technical program organized around a single question: how to preserve the generality and capability of large VLA models while meeting real-time control requirements on edge platforms. The present literature answers that question with several non-exclusive strategies—non-autoregressive action prediction, compact language backbones, roofline-guided performance analysis, chunk-level or layer-level edge-cloud collaboration, asynchronous intent realignment, and hierarchical decoupling of perception from control. The diversity of these strategies suggests that EVLA is not a settled architecture, but an active systems-design space defined by compute, memory, bandwidth, and control-loop constraints.