Papers
Topics
Authors
Recent
Search
2000 character limit reached

NeuroVLA: Neuromorphic Vision-Language-Action System

Updated 28 January 2026
  • NeuroVLA is an embodied intelligence framework that mimics biological nervous system hierarchies by integrating vision, language, and motor modalities.
  • It employs a tri-level architecture—cortical planning, cerebellar adaptation, and spinal fast action generation—using neuromorphic hardware for low-latency, energy-efficient control.
  • Quantitative benchmarks show significant improvements in tremor attenuation, reflex speed, and task success, demonstrating robust and adaptive robotic performance.

Neuromorphic Vision-Language-Action (NeuroVLA) refers to a system-level embodied intelligence framework that emulates the organizational hierarchy of biological nervous systems—specifically mirroring the division of labor among the cortex, cerebellum, and spinal cord—across integrated vision, language, and motor modalities. NeuroVLA enables physical robots to fluidly translate high-level semantic commands into low-latency, adaptive, energetically efficient motor control, with emergent properties such as temporal memory, reflexes, and somatotopic organization. This architecture represents the first neuromorphic VLA design implemented in real-world robotics, systematically closing the gap between large-scale vision-language systems and the core adaptive-motor strengths of animal movement (Guo et al., 21 Jan 2026).

1. System Organization and Data Flow

NeuroVLA is structured as a tri-level hierarchy, each stage mapped to an analogous component in the biological motor system:

  1. Cortical Module (High-Level Goal Planner): Operating on the GPU/CUDA tier, the cortex receives an RGB image ItRH×W×3I_t \in \mathbb{R}^{H \times W \times 3} and a language instruction LL. A pretrained vision–LLM backbone (e.g., Qwen-VL) produces "world-model" features Ht=FVLM(It,L;θvlm)\mathcal{H}_t = F_{VLM}(I_t, L; \theta_{vlm}). A lightweight Q-Former module with KK learnable queries QRK×DQ \in \mathbb{R}^{K \times D} attends to intermediate VLM layers, yielding zsem=Q-Former(Ht[lstart:lend],Q;θqf)RK×Dactionz_{sem} = Q\text{-}Former(\mathcal{H}_t[l_{start}:l_{end}], Q;\theta_{qf}) \in \mathbb{R}^{K \times D_{action}}, encoding "what to do" in action space, factored from low-level dynamics.
  2. Cerebellar Module (Adaptive Stabilizer): Also on CUDA, the cerebellum handles proprioceptive (sth:tRH×Dss_{t-h:t} \in \mathbb{R}^{H \times D_s}) and force (6-DoF wrench) sensory histories at high rates. It infers a recurrent hidden state ht=GRU(sth:t;θgru)h_t = GRU(s_{t-h:t}; \theta_{gru}) capturing system velocity and impulses. Gated FiLM layers generate scale and shift parameters (γt,βt)(\gamma_t, \beta_t):

gt=σ(WgProj(ht)),zmod=(1+γt)(zsemgt)+βtg_t = \sigma(W_g \cdot \text{Proj}(h_t)), \quad z_{mod} = (1 + \gamma_t) \odot (z_{sem} \odot g_t) + \beta_t

An iterative internal loop predicts st+1s_{t+1} and refines zmodz_{mod}, enabling predictive, feedback-based stabilization analogous to adaptive PID control.

  1. Spinal Module (Fast Action Generator): Implemented as a deep spiking ResNet of Leaky Integrate-and-Fire (LIF) neurons on a neuromorphic chip (FPGA), the spinal layer maintains stateful membrane potential dynamics and residual connections across layers:

ui(l)[τ]=βui(l)[τ1]+jwijsj(l1)[τ]si(l)[τ1]θthr si[τ]=Θ(ui[τ]θthr) x(l+1)=x(l)+LIF(Linear(x(l)))u_i^{(l)}[\tau] = \beta u_i^{(l)}[\tau-1] + \sum_j w_{ij} s_j^{(l-1)}[\tau] - s_i^{(l)}[\tau-1]\theta_{thr} \ s_i[\tau] = \Theta(u_i[\tau] - \theta_{thr}) \ x^{(l+1)} = x^{(l)} + \text{LIF}(\text{Linear}(x^{(l)}))

Spikes are routed to continuous-integration output neurons for generating motor actions at[τ]=Woutuout[τ]a_t[\tau] = W_{out} u_{out}[\tau] in real time.

Dataflow Summary

Input/Output Module Description
(It,L)(I_t, L) Cortex RGB frame, language to semantic intent zsemz_{sem}
sth:ts_{t-h:t} Cerebellum Proprio history to modulation zmodz_{mod}
zmodz_{mod} Spinal Fast, event-driven action command ata_t

2. Neuromorphic Processing Hardware

NeuroVLA utilizes custom neuromorphic hardware for real-time, energy-efficient actuation:

  • Platform: Custom systolic-array neuromorphic core realized on FPGA at 20 MHz.
  • Resources: 51,953 LUTs, 27,880 FFs, 169 BRAMs, supporting 10410^410510^5 spiking neurons, 106\sim10^6 synapses.
  • Performance: Single inference pass latency is 2.19 ms; energy per inference is 0.87 mJ.
  • Throughput: At 200 Hz (real-time control), the overall system consumes approximately 0.174 W (not including overhead), with total system draw 0.4\approx 0.4 W.
  • Communication:
    • Cortex and cerebellum exchange continuous-valued FiLM parameters.
    • Cerebellar outputs (zmodz_{mod}) are encoded as spike-rate targets for the spinal stage.
    • Within the chip, event-driven Address-Event Representation (AER) and spike-sparsity modules minimize bandwidth by filtering inactive channels.

This hardware-software co-design enables continuous, low-power, event-driven control loops directly within robotic actuators (Guo et al., 21 Jan 2026).

3. Quantitative Performance and Benchmarks

NeuroVLA demonstrates superior performance across multiple robotic control metrics, both in simulation and real-world physical deployments:

  • Low-Frequency Tremor Attenuation:
    • Mean Absolute Jerk: reduced by 75.6% (up to 80%)
    • Mean Absolute Acceleration: reduced by 32.8–58.0% across axes
  • Collision Reflex and Withdrawal:
    • Reflex latency << 20 ms (monosynaptic-like), compared to >>200 ms for standard vision–language actuator loops
    • Under severe force spikes (μFx37\mu_{Fx}\approx-37 N, μFy8.3\mu_{Fy}\approx8.3 N, σFy23.27\sigma^2_{Fy}\approx3.27)
    • Baseline methods fail to replan
    • NeuroVLA reroutes successfully 100% in simulation and 54.8% in real runs
  • Rhythmic Motor Memory:
    • Sustained, phase-locked shake cycles maintained under visual occlusion
    • Success rates remain stable (within ±2%\pm2\%) across lighting and texture disturbances; baseline methods lose >>30% success in such conditions
  • Ablation Studies (LIBERO):
    • Multi-step SNN: 82% success on long-horizon tasks
    • Single-step SNN: 65% (17%-17\% ablation)
    • No-cerebellum: 54%
  • Comparative Benchmarks: Table shows NeuroVLA compared to OpenVLA, UniVLA, and WorldVLA. All differences are statistically significant (p<0.01p<0.01).
Task NeuroVLA OpenVLA UniVLA WorldVLA
Relocate test tubes 88% 47% 51% 55%
Pour liquid 78% 39% 44% 48%
Shake flask rhythmicity 75% 32% 35% 38%
Organize items 90% 60% 62% 65%
Discard waste 83% 55% 58% 61%
Safety-critical collision 54.8% 0% 0% 0%

4. Emergence of Biological Motor Properties

NeuroVLA exhibits several properties analogous to natural animal motor control:

  • Temporal Working Memory: The combination of LIF neuron leak currents and deep residual connections yields an intrinsic short-term memory mechanism. Multi-step SNNs outperformed single-step models by 17%, demonstrating phase tracking and working memory without explicit sequential supervision.
  • Reflex Arcs: Monosynaptic-like safety reflexes are realized by routing 6-DoF wrench feedback rapidly from the cerebellum to the spinal cord, triggering withdrawal in less than 20 ms—faster than the cortex can replan, and with no need for extra offline supervision.
  • Event-Driven Sparsity and Somatotopy: During periods with no required motion ("static holds"), mean firing rates within the spinal module dropped by 85%, conserving energy. Spinal activation visualizations (t-SNE) exposed spontaneous clustering into motor primitives controlling different degrees of freedom, with spatially segregated subpopulations—mirroring somatotopic mapping seen in biological organisms.

5. Implications and Context

By structurally decomposing vision–language–action intelligence into distinct, synergistic substrates, NeuroVLA represents a scalable paradigm for robust and low-power embodied intelligence:

  • Division of Labor: Decoupling semantic intention (cortex), dynamic adaptation (cerebellum), and event-based actuation (spinal cord) enables rapid, context-sensitive control with attributes difficult to attain in monolithic, non-neuromorphic VLA stacks.
  • Energy Efficiency: With total power draw of \sim0.4 W on neuromorphic hardware, NeuroVLA attains high performance at a fraction of the energy costs of conventional robotic controllers.
  • Generalization and Robustness: The architecture maintains fluid trajectories (>>75% jitter reduction under disturbance), ultra-fast reflexes, and emergent temporal memory without task-specific auxiliary losses or additional dataset annotation.
  • Practical Integration: Successful benchmarks across multiple manipulation tasks and safety scenarios highlight the applicability to diverse embodied settings and critical real-world deployments.

A plausible implication is that biologically inspired separation of supervisory planning, adaptive modulation, and reflexive actuation may be fundamental for the next generation of low-latency, energy-efficient, and resilient autonomous agents (Guo et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neuromorphic Vision-Language-Action (NeuroVLA).