NeuroVLA: Neuromorphic Vision-Language-Action System
- NeuroVLA is an embodied intelligence framework that mimics biological nervous system hierarchies by integrating vision, language, and motor modalities.
- It employs a tri-level architecture—cortical planning, cerebellar adaptation, and spinal fast action generation—using neuromorphic hardware for low-latency, energy-efficient control.
- Quantitative benchmarks show significant improvements in tremor attenuation, reflex speed, and task success, demonstrating robust and adaptive robotic performance.
Neuromorphic Vision-Language-Action (NeuroVLA) refers to a system-level embodied intelligence framework that emulates the organizational hierarchy of biological nervous systems—specifically mirroring the division of labor among the cortex, cerebellum, and spinal cord—across integrated vision, language, and motor modalities. NeuroVLA enables physical robots to fluidly translate high-level semantic commands into low-latency, adaptive, energetically efficient motor control, with emergent properties such as temporal memory, reflexes, and somatotopic organization. This architecture represents the first neuromorphic VLA design implemented in real-world robotics, systematically closing the gap between large-scale vision-language systems and the core adaptive-motor strengths of animal movement (Guo et al., 21 Jan 2026).
1. System Organization and Data Flow
NeuroVLA is structured as a tri-level hierarchy, each stage mapped to an analogous component in the biological motor system:
- Cortical Module (High-Level Goal Planner): Operating on the GPU/CUDA tier, the cortex receives an RGB image and a language instruction . A pretrained vision–LLM backbone (e.g., Qwen-VL) produces "world-model" features . A lightweight Q-Former module with learnable queries attends to intermediate VLM layers, yielding , encoding "what to do" in action space, factored from low-level dynamics.
- Cerebellar Module (Adaptive Stabilizer): Also on CUDA, the cerebellum handles proprioceptive () and force (6-DoF wrench) sensory histories at high rates. It infers a recurrent hidden state capturing system velocity and impulses. Gated FiLM layers generate scale and shift parameters :
An iterative internal loop predicts and refines , enabling predictive, feedback-based stabilization analogous to adaptive PID control.
- Spinal Module (Fast Action Generator): Implemented as a deep spiking ResNet of Leaky Integrate-and-Fire (LIF) neurons on a neuromorphic chip (FPGA), the spinal layer maintains stateful membrane potential dynamics and residual connections across layers:
Spikes are routed to continuous-integration output neurons for generating motor actions in real time.
Dataflow Summary
| Input/Output | Module | Description |
|---|---|---|
| Cortex | RGB frame, language to semantic intent | |
| Cerebellum | Proprio history to modulation | |
| Spinal | Fast, event-driven action command |
2. Neuromorphic Processing Hardware
NeuroVLA utilizes custom neuromorphic hardware for real-time, energy-efficient actuation:
- Platform: Custom systolic-array neuromorphic core realized on FPGA at 20 MHz.
- Resources: 51,953 LUTs, 27,880 FFs, 169 BRAMs, supporting – spiking neurons, synapses.
- Performance: Single inference pass latency is 2.19 ms; energy per inference is 0.87 mJ.
- Throughput: At 200 Hz (real-time control), the overall system consumes approximately 0.174 W (not including overhead), with total system draw W.
- Communication:
- Cortex and cerebellum exchange continuous-valued FiLM parameters.
- Cerebellar outputs () are encoded as spike-rate targets for the spinal stage.
- Within the chip, event-driven Address-Event Representation (AER) and spike-sparsity modules minimize bandwidth by filtering inactive channels.
This hardware-software co-design enables continuous, low-power, event-driven control loops directly within robotic actuators (Guo et al., 21 Jan 2026).
3. Quantitative Performance and Benchmarks
NeuroVLA demonstrates superior performance across multiple robotic control metrics, both in simulation and real-world physical deployments:
- Low-Frequency Tremor Attenuation:
- Mean Absolute Jerk: reduced by 75.6% (up to 80%)
- Mean Absolute Acceleration: reduced by 32.8–58.0% across axes
- Collision Reflex and Withdrawal:
- Reflex latency 20 ms (monosynaptic-like), compared to 200 ms for standard vision–language actuator loops
- Under severe force spikes ( N, N, )
- Baseline methods fail to replan
- NeuroVLA reroutes successfully 100% in simulation and 54.8% in real runs
- Rhythmic Motor Memory:
- Sustained, phase-locked shake cycles maintained under visual occlusion
- Success rates remain stable (within ) across lighting and texture disturbances; baseline methods lose 30% success in such conditions
- Ablation Studies (LIBERO):
- Multi-step SNN: 82% success on long-horizon tasks
- Single-step SNN: 65% ( ablation)
- No-cerebellum: 54%
- Comparative Benchmarks: Table shows NeuroVLA compared to OpenVLA, UniVLA, and WorldVLA. All differences are statistically significant ().
| Task | NeuroVLA | OpenVLA | UniVLA | WorldVLA |
|---|---|---|---|---|
| Relocate test tubes | 88% | 47% | 51% | 55% |
| Pour liquid | 78% | 39% | 44% | 48% |
| Shake flask rhythmicity | 75% | 32% | 35% | 38% |
| Organize items | 90% | 60% | 62% | 65% |
| Discard waste | 83% | 55% | 58% | 61% |
| Safety-critical collision | 54.8% | 0% | 0% | 0% |
4. Emergence of Biological Motor Properties
NeuroVLA exhibits several properties analogous to natural animal motor control:
- Temporal Working Memory: The combination of LIF neuron leak currents and deep residual connections yields an intrinsic short-term memory mechanism. Multi-step SNNs outperformed single-step models by 17%, demonstrating phase tracking and working memory without explicit sequential supervision.
- Reflex Arcs: Monosynaptic-like safety reflexes are realized by routing 6-DoF wrench feedback rapidly from the cerebellum to the spinal cord, triggering withdrawal in less than 20 ms—faster than the cortex can replan, and with no need for extra offline supervision.
- Event-Driven Sparsity and Somatotopy: During periods with no required motion ("static holds"), mean firing rates within the spinal module dropped by 85%, conserving energy. Spinal activation visualizations (t-SNE) exposed spontaneous clustering into motor primitives controlling different degrees of freedom, with spatially segregated subpopulations—mirroring somatotopic mapping seen in biological organisms.
5. Implications and Context
By structurally decomposing vision–language–action intelligence into distinct, synergistic substrates, NeuroVLA represents a scalable paradigm for robust and low-power embodied intelligence:
- Division of Labor: Decoupling semantic intention (cortex), dynamic adaptation (cerebellum), and event-based actuation (spinal cord) enables rapid, context-sensitive control with attributes difficult to attain in monolithic, non-neuromorphic VLA stacks.
- Energy Efficiency: With total power draw of 0.4 W on neuromorphic hardware, NeuroVLA attains high performance at a fraction of the energy costs of conventional robotic controllers.
- Generalization and Robustness: The architecture maintains fluid trajectories (75% jitter reduction under disturbance), ultra-fast reflexes, and emergent temporal memory without task-specific auxiliary losses or additional dataset annotation.
- Practical Integration: Successful benchmarks across multiple manipulation tasks and safety scenarios highlight the applicability to diverse embodied settings and critical real-world deployments.
A plausible implication is that biologically inspired separation of supervisory planning, adaptive modulation, and reflexive actuation may be fundamental for the next generation of low-latency, energy-efficient, and resilient autonomous agents (Guo et al., 21 Jan 2026).