Papers
Topics
Authors
Recent
2000 character limit reached

ViVLA: One-Shot Robotic Skill Transfer

Updated 12 December 2025
  • ViVLA is a vision-language-action model that enables zero-shot robotic skill acquisition from a single expert demonstration video.
  • It processes expert videos, natural language instructions, and real-time images using a unified transformer-based architecture to generate latent action plans and control signals.
  • Leveraging nearly 900,000 expert-agent pairs, ViVLA outperforms earlier methods with significant gains in unseen tasks and cross-embodiment transfer.

ViVLA refers to Vision-Language-Action models, with core contributions addressed in two distinct research domains: (1) visual analytics for galactic-scale astrophysics (Vitello et al., 2018), and (2) generalist robotic manipulation with one-shot video imitation (Chen et al., 8 Dec 2025). The term has been used to denote both a specialized analytics toolkit for Galactic Plane star formation studies and, more recently, an architecture for one-shot skill transfer in robotics. The following exposition focuses on the robotics context, reflecting ViVLA’s most recent and widely cited connotation.

ViVLA (“See Once, Then Act”) is a generalist robotic manipulation policy that achieves efficient task learning from a single expert demonstration video at test time. The approach processes an expert demonstration video in conjunction with a language instruction and the robot’s real-time visual observations, enabling the distillation and transfer of fine-grained manipulation knowledge from expert behavior to the agent. ViVLA leverages a scalable expert-agent data generation pipeline producing nearly 900,000 expert-agent pairs, enabling immediate generalization to novel manipulation tasks and embodiment shifts, outperforming prior Vision-Language-Action and one-shot imitation learning methods (Chen et al., 8 Dec 2025).

1. Model Principles and One-Shot Learning Paradigm

ViVLA is defined by its ability to achieve immediate (“zero-shot”) acquisition of new manipulation skills from a single expert demonstration video, without the need for task-specific fine-tuning. At test time, the system jointly processes: (a) a sparsely sampled video of an expert (human or robot) performing a novel, previously unseen task, (b) a natural language instruction, and (c) the robot’s current camera images. These inputs are fused to yield both a latent action plan (encoding the demonstration’s fine-grained semantics) and a direct policy for generating low-level control signals.

This one-shot learning paradigm departs fundamentally from previous models—such as RT-2, OpenVLA, or π₀—which require large-scale in-domain data or extensive fine-tuning to generalize to new tasks. ViVLA explicitly bridges the task and embodiment gap between disparate agents (e.g., human video → robot policy), providing robust generalization to both unseen tasks and device morphologies (Chen et al., 8 Dec 2025).

2. Architecture and Latent Action Modeling

Input Encoding and Joint Processing

ViVLA encodes three input modalities:

  • Expert demonstration video: Frames {v1,,vT}\{v_{1},\ldots,v_{T}\} are sparsely sampled and embedded using a Vision Transformer (ViT) with window-based attention (Qwen2.5-VL backbone). The embeddings are merged into the LLM token space via a lightweight MLP vision–language fusion block.
  • Language instruction: Processed using the Qwen2.5 tokenizer and transformer layers.
  • Robot observation: Real-time images {ot}\{o_{t}\} are embedded using the same ViT, merged to match the LLM’s latent space.

Latent Action Tokenizer (“Expert–Agent Bridge”)

A central architectural innovation is the explicit construction of a latent action vocabulary shared by both expert and agent domains. The latent action encoder E\mathcal{E} extracts DINOv2 image features from frame pairs (ft,ft+H)(f_t, f_{t+H}), concatenates them with lzl_z learnable tokens, and processes them via a multi-layer spatiotemporal transformer. The resultant continuous latent tokens ztez^e_t are vector-quantized: ztq=VQ(zte){1,,K}lzz^q_t = \mathrm{VQ}(z^e_t) \in \{1,\ldots,K\}^{l_z} The decoder D\mathcal{D} reconstructs next-frame predictions from these quantized latents, independent of action labels. This process enables robust alignment between video-based demonstration semantics and robot control spaces.

Policy Decoding Pipeline

The core ViVLA pipeline utilizes the frozen Qwen2.5-VL transformer augmented with “query tokens” for both latent actions (LACT) and robot actions (ACT). ViVLA conducts parallel decoding: a sequence of LACT tokens is predicted to yield latent action plans; ACT tokens decode to continuous robot actions. The action decoder pools ACT embeddings through attention and an MLP head, yielding the predicted control signals a^t\hat{a}_t.

3. Training Regimes and Objective Functions

ViVLA employs multi-component objectives across its modular pipeline:

  • Latent action VQ-VAE loss: Enforces discrete codebook alignment of latent action tokens.
  • Reconstruction loss: Minimizes L1L_1 error between predicted and ground-truth future frames.
  • Action-centric cycle consistency: Constrains feature representations by reconstructing image–latent–image cycles.
  • Adversarial alignment: Employs local and global discriminators for domain adaptation between video and robot feature distributions.
  • Latent action prediction loss: Cross-entropy loss on predicted latent actions.
  • Robot action regression: L1L_1 loss between predicted and ground-truth robot actions.

The aggregate objective is: Ltotal=λVQLVQ+λrecLrec+λCLC+λGANLGAN+λzLz+λaLa\mathcal{L}_{total} = \lambda_{VQ}\mathcal{L}_{VQ} + \lambda_{rec}\mathcal{L}_{rec} + \lambda_{C}\mathcal{L}_{C} + \lambda_{GAN}\mathcal{L}_{GAN} + \lambda_{z}\mathcal{L}_{z} + \lambda_{a}\mathcal{L}_{a} where each term is weighted as determined during experimentation.

Training uses AdamW optimizer, batch size 256, and significant masking for regularization. Fine-tuning for new robot targets is performed using full decoder updates and LoRA on the vision–language backbone.

4. Expert–Agent Data Generation and Scaling

ViVLA’s broad generalization capability is supported by a large-scale expert-agent data synthesis pipeline:

  • Human video sources: Over 7,400 egocentric human demonstration videos (Ego4D, EgoDex) covering 100+ tasks.
  • Synthetic robot pairing: Hand/object poses estimated with HaMeR and FoundationPose are mapped to 6D robot end-effector trajectories. Each video is automatically segmented into grasp and manipulation sub-clips. 3D Gaussian Splatting is used to build 4D scenes (objects and simulated robot URDFs), which are then used for replay via motion planning.
  • Augmentations: Multi-view rendering and randomized textures/light conditions ensure robustness.
  • Dataset summary: 89,736 synthetic human-to-robot pairs and 803,175 public expert-agent pairings (selected via high similarity in sentence-BERT embedding space), yielding 892,911 trajectories.

This approach supports coverage of over 100 manipulation tasks, including all 130 tasks in LIBERO’s benchmark.

5. Evaluation, Results, and Ablation Studies

Benchmark Results

ViVLA demonstrates state-of-the-art performance on the LIBERO suite and real-robot experiments:

  • LIBERO Benchmarks:
    • On unseen tasks, ViVLA achieves 0.70 (LIBERO), a 30%+ absolute gain over prior VLA and OSIL methods.
    • Cross-embodiment (UR→Franka): ViVLA_R achieves 0.71 success, over twice that of AWDA_R at 0.32.
  • Real-World Human Video Transfer:
    • On 12 real-world tasks (Franka robot), ViVLA attains seen/unseen rates of 0.96/0.74, whereas the best baseline, AWDA, achieves only 0.36 on unseen tasks.
Method LIBERO Seen LIBERO Unseen
Diff. Policy 0.76 0.01
OpenVLA 0.82 0.05
UniVLA 0.95 0.16
AWDA 0.71 0.40
ViVLA 0.98 0.70

Ablations

  • Removing latent-action prediction decreases unseen task performance from 0.71 to 0.48.
  • Removing temporal–spatial masking, adversarial discriminators, or using autoregressive decoding all result in significant performance drops on unseen tasks.
  • Both language and video inputs are essential; ablating either results in unseen-task success falling below 0.5.

Robustness

ViVLA is robust to variations in object counts, spatial layouts, and environmental factors (viewpoint, lighting), experiencing less than 10% performance degradation.

6. Comparative Position and Limitations

ViVLA establishes a new capability frontier for one-shot skill acquisition from demonstration in robotics. In contrast to prior Vision-Language-Action and one-shot imitation approaches (OSIL, AWDA, T-OSVI), which have limited cross-embodiment transfer (≤ 0.40 success on unseen tasks), ViVLA achieves > 0.65 on unseen tasks and robust cross-agent generalization.

Limitations include failure cases due to perceptual occlusions (e.g., static camera with occluded gripper or low visual coverage). Proposed mitigations include egocentric wrist-mounted cameras and further data augmentation with recovery trajectory samples. Current expert–agent data synthesis relies on manual curation; scaling to Internet-wide automatic video mining is a planned future direction (Chen et al., 8 Dec 2025).


For the astrophysics visual analytics system VIA Lactea Visual Analytics (ViVLA), see (Vitello et al., 2018). For the Visual-Language-Latent-Action paradigm in manipulation robotics, see also villa-X (Chen et al., 31 Jul 2025) for alternative latent action modeling.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ViVLA.