- The paper introduces a novel task-centric latent action learning approach that decouples task-relevant dynamics from irrelevant variations using unsupervised video data.
- It employs an auto-regressive vision-language model to predict compressed latent action tokens, drastically reducing pretraining compute (960 vs 21,500 A100-hours).
- The framework achieves superior performance across tasks, with manipulation success at 95.4% and navigation oracle rate at 47.1%, while generalizing across diverse robot embodiments.
UniVLA (2505.06111) is a unified vision-language-action (VLA) framework designed to enable robot policy learning across diverse environments and embodiments without relying heavily on action-annotated data. The core idea is to learn task-centric latent action representations directly from videos, including heterogeneous robot datasets and egocentric human videos. This approach aims to overcome the limitations of traditional VLA models that are often restricted to specific physical specifications and struggle with knowledge transfer across different robot types and tasks.
The framework consists of three main stages:
- Task-centric Latent Action Learning: This stage focuses on creating an embodiment-agnostic action space by deriving pseudo action labels (latent action tokens) from large-scale videos in an unsupervised manner.
- Inverse Dynamics Model (IDM): An IDM-based encoder I(at∣ot,ot+k) is trained to infer latent actions given pairs of consecutive video frames {ot,ot+k}.
- Forward Dynamics Model (FDM): An FDM-based decoder F(ot+k∣ot,at) is trained to predict future observations given current observations and inferred latent actions.
- DINOv2 Features: Instead of raw pixels, DINOv2 spatial patch features [oquab2023dinov2] are used as inputs and prediction targets (Ot,Ot+k). This provides semantic richness and object-centric priors, making the model less susceptible to noisy, task-irrelevant visual details compared to pixel-based methods [ye2024lapa]. The objective minimizes the DINOv2 embedding reconstruction error: ∥O^t+k−Ot+k∥2.
- Latent Action Quantization: Latent actions are discretized using a VQ-VAE [van2017neural] with a codebook ∣C∣. This compresses information and aligns with the discrete token space used by transformer policies.
- Task-centric Decoupling: A novel two-stage training process is introduced to disentangle task-centric dynamics from task-irrelevant changes.
- Stage 1: Language instructions (ℓ) are used as conditioning signals in the encoder and decoder. The model learns an initial set of latent actions (a~TI) that encode environmental changes, with language guiding away from task-relevant details.
- Stage 2: The codebook for a~TI is frozen, and a new set of task-centric latent actions (a~TC) is learned. These new tokens are optimized to encode the task-related dynamics (e.g., object manipulation). This explicit decoupling makes the latent action space more informative for policy learning.
- Pretraining of Generalist Policy: An auto-regressive vision-LLM is trained to predict the next latent action token sequence, given visual observations and language instructions.
- Architecture: Built upon the Prismatic-7B VLM [karamcheti2024prismatic], which includes fused SigLip [zhai2023siglip] and DINOv2 visual encoders, a projection layer, and a LLaMA-2 LLM [touvron2023llama].
- Latent Action Tokens: The LLM's vocabulary is extended with special tokens {ACT_1, ..., ACT_C} corresponding to the latent action codebook indices. The policy πϕ(az,i∣ot,l,az,<i) is trained to predict the next latent action token az,i.
- Efficiency: Operating in a compressed latent action space (e.g., 164 vs 2567 in OpenVLA) significantly reduces pretraining computation. UniVLA is reported to achieve competitive results with drastically less compute (960 A100-hours) compared to OpenVLA (21,500 A100-hours).
- Post-training for Deployment: The pretrained policy is adapted to specific robotic systems by decoding latent actions into executable control signals.
- Action Decoder: Specialized policy heads are added to the VLM backbone. Visual embeddings and latent action embeddings from the VLM are processed through attention pooling and then projected linearly to the target robot's action space. The decoder head is lightweight (12.6M parameters).
- Efficient Adaptation: Parameter-efficient fine-tuning (LoRA [hu2021lora]) is used, making the total trainable parameters around 123M. The model is trained end-to-end minimizing both next-latent action prediction loss and L1 loss on low-level actions.
- Action Chunks: Latent actions, representing dynamics over ~1 second, are naturally decoded into action chunks, aligning with robot control frequencies for smoother execution.
- History Outputs: Historical latent action outputs from previous timesteps are incorporated into the input prompt at inference time. This provides temporal context and enables the policy to learn from its own past decisions, similar to Chain-of-Thought reasoning in LLMs, which is particularly beneficial for long-horizon tasks.
Here's a simplified conceptual architecture diagram:
1
2
3
4
5
6
7
8
|
Observation (o_t) -> DINOv2 Encoder -> Visual Embeddings (E_v)
Instruction (l) -> T5 Encoder -> Instruction Embeddings (ell)
Historical Latent Actions -> Tokenization -> History Tokens
[Visual Embeddings, Instruction Embeddings, History Tokens] -> Prismatic-7B VLM -> Predicted Next Latent Action Tokens (a_z)
Predicted Latent Action Tokens (a_z) + Visual Embeddings (E_v) -> Action Decoder -> Robot Actions |
Practical Implementation Details & Considerations:
- Data Preprocessing: Videos are paired with future frames at a fixed interval (calibrated to ~1 second per dataset) for latent action labeling. DINOv2 features are extracted for image representation. Text instructions are processed by T5.
- Latent Action Space Size: The codebook size ∣C∣ and sequence length N of latent action tokens are hyperparameters influencing the action space complexity. UniVLA uses a compressed space (e.g., 164) which contributes to training efficiency.
- Computational Requirements: Pretraining is significantly cheaper than previous VLA methods due to the compressed latent action space and efficient policy backbone fine-tuning. Downstream adaptation requires minimal data and compute. Real-world inference speed of 10Hz on an RTX 4090 is reported, suitable for real-time closed-loop control via action chunking.
- Embodiment Adaptation: The unified latent action space allows the generalist policy to be the same across embodiments. Only the lightweight action decoder needs to be specialized and trained for a new robot's physical action space.
- Dealing with Noise: Using DINOv2 features and the task-centric decoupling mechanism helps make the learned latent actions robust to visual noise and task-irrelevant dynamics common in internet videos.
- Long-Horizon Tasks: Incorporating historical latent actions as input tokens is a simple yet effective technique to improve performance on tasks requiring sequential decision-making and temporal context.
Performance & Applications:
UniVLA demonstrates state-of-the-art performance across various manipulation benchmarks (LIBERO, CALVIN, SimplerEnv) and navigation (R2R). Key results include:
- Manipulation: Outperforms OpenVLA and LAPA significantly on LIBERO, achieving 95.4% average success vs. OpenVLA's 76.5% and LAPA's 65.7%. Shows strong performance even when pretrained only on limited data like Bridge-V2 or human videos.
- Navigation: Achieves 47.1% oracle success rate on R2R val-unseen, significantly surpassing baselines like OpenVLA (17.5%) and achieving comparable results to models using full history (NaVid).
- Real-World: Exhibits superior generalizability and robustness on real-robot tasks under variations in lighting, distractors, and novel objects, with an average success rate of 81.7% and average score of 2.63 across four diverse tasks, outperforming OpenVLA (38.3% success, 1.63 score) and LAPA (45% success, 1.95 score).
This work is a significant step towards developing generalist robot policies that can learn from vast, heterogeneous video data available on the internet, bridging the gap between diverse robot embodiments and even human demonstrations through a shared, task-focused latent action representation.