EgoVLA: Egocentric Vision-Language-Action Model
- EgoVLA is a vision-language-action framework that integrates egocentric human videos to learn robotic manipulation policies.
- It employs an NVILA-2B based backbone with cross-modal fusion and dedicated MLPs to align visual, language, and proprioceptive inputs.
- Fine-tuning on robot teleoperation data achieves superior success rates, demonstrating robust generalization across diverse manipulation tasks.
EgoVLA is a class of Vision-Language-Action (VLA) models explicitly designed for learning robotic manipulation policies from large-scale egocentric human videos. It integrates multimodal perception, action prediction, and retargeting to robotic platforms, bridging the gap between human demonstration and policy transfer. By leveraging the diversity and scale of human egocentric video, EgoVLA models acquire transferrable manipulation priors, enabling effective few-shot generalization to robotic hardware after fine-tuning. The approach is structured around unified data representations, model architectures, and simulation-to-real transfer pipelines (Yang et al., 16 Jul 2025).
1. Model Architecture and Representation
The EgoVLA architecture is centered on a vision–language backbone based on NVILA-2B, incorporating both visual and linguistic modalities. The visual encoder processes 6 egocentric RGB frames (), sampled at 0.2 s intervals, while a Transformer-based language encoder embeds a succinct instruction. Cross-modal attention layers then fuse these modalities into a shared latent representation .
Human proprioceptive state at each timestep is encoded as wrist translation , wrist rotation (rot6D), and hand pose parameters (MANO PCA space). Each is passed through dedicated MLPs to generate embeddings that are integrated into , aligning action representation geometry across domains.
The action head is a 6-layer Transformer (hidden size 1536) that accepts action-query tokens and encoded proprioceptive state, autoregressively forecasting a sequence , where (Yang et al., 16 Jul 2025).
2. Training Data and Preprocessing
EgoVLA is pre-trained on approximately 500K egocentric video–action samples collected from four primary datasets: HOI4D, HOT3D, HoloAssist, and TACO. The data encompass diverse indoor settings and manipulative tasks spanning 151 unique tool–action–object combinations. Visual sequences are sampled at 3 FPS with 1 s sliding windows.
All hand/wrist poses are homogenized into the MANO model’s 15D PCA space, and wrist rotations are parameterized by rot6D. For language, ground-truth captions are used when available; otherwise, placeholders are inserted. The unified proprioceptive-action sequence aligns human and robot data for subsequent retargeting (Yang et al., 16 Jul 2025).
3. Human Action Prediction and Losses
The model’s objective is to accurately forecast short-horizon (1 s) future human hand and wrist poses. The composite loss function is: with
0
and weighting 1, 2, 3.
This loss structure reflects the differing dynamic ranges and importances of translation, orientation, and hand articulation for downstream retargeting and control (Yang et al., 16 Jul 2025).
4. Retargeting Human Predictions to Robot Actions
To enable transfer of human demonstration to robot manipulators, EgoVLA establishes a unified actuation space via the MANO hand model. Given a robot hand pose, the system solves for human MANO parameters 4 that minimize the average fingertip position error: 5 where 6 are the fingertip positions from the kinematic chain.
Predicted wrist pose 7 in the device camera frame is mapped to the robot’s base frame, and inverse kinematics solves for robot joint angles 8 via
9
At inference, a dedicated MLP (four layers, [64, 128, 64] hidden sizes) predicts 12-DOF robot hand commands, trained on retargeted human–robot demonstration pairs (Yang et al., 16 Jul 2025).
5. Fine-Tuning and Policy Adaptation on Robot Demonstrations
Following extensive human egocentric pretraining, the VLA is fine-tuned on moderate-scale robot teleoperation data from the Isaac Humanoid Manipulation Benchmark. Each of 12 manipulation tasks is associated with 100 successful teleoperation episodes, collected in simulation by Open-TeleVision with physical hand controllers.
The entire model (vision–language backbone and action head) is fine-tuned for 115 epochs (following 20 epochs of human-only pretraining), with a batch configuration of 16 videos × 8 action chunks × 4 GPUs. The schedule employs a constant cosine learning rate starting at 0, decaying to 1 after 100 epochs (Yang et al., 16 Jul 2025).
6. Evaluation: Ego Humanoid Manipulation Benchmark
The Ego Humanoid Manipulation Benchmark comprises 12 tasks subdivided into short-horizon (atomic) and long-horizon (multi-stage) manipulation scenarios:
- Short-horizon: Push-Box, Flip-Mug, Pour-Balls, Close-Drawer, Open-Drawer, Open-Laptop, Stack-Can
- Long-horizon: Sort-Cans, Insert-Cans, Unload-Cans, Insert-And-Unload-Cans, Stack-Can-Into-Drawer
Observations encompass egocentric RGB-D, end-effector states, hand joint configurations, and language instruction. Each action is a 36-dimensional vector, spanning both hands (6-D rot6D + 3-D translation + 15-D hand actuation per hand); the control frequency is 30 Hz.
The evaluation protocol rigorously tests generalization by exposing agents to both seen and 22 held-out (unseen) visual configurations, with randomized object positions and backgrounds. Key metrics:
- Success Rate (SR): fraction of episodes where the final goal is reached
- Progress Rate (PSR): fraction of subtasks completed (for composite/hierarchical tasks) (Yang et al., 16 Jul 2025)
7. Empirical Outcomes, Ablations, and Limitations
On seen backgrounds, EgoVLA achieves short-horizon SR of 77.8% (vs. 64.6% for no pretrain, 24.9% for ACT baseline) and long-horizon SR of 45.9% (vs. 26.7% no pretrain, 2.2% ACT). On unseen backgrounds, the short-horizon SR is 76.3% (vs. 62.6% EgoVLA-NoPretrain), and the long-horizon SR is 28.8% (vs. 11.2%).
Ablations reveal that halving the robot demonstration scale causes significant SR degradation (short: 77.8%→48.2%; long: 45.9%→7.4%). Increasing human video scale or diversity directly improves downstream SR/PSR. The PSR drop on unseen scenes is minor, suggesting failures cluster at the final subgoals rather than throughout the episode (Yang et al., 16 Jul 2025).
Key findings:
- Pretraining on egocentric human videos imparts transferable manipulation priors and advances both in-domain and out-of-distribution generalization.
- The unified MANO-based action space enables seamless robot–human alignment and efficient fine-tuning.
- EgoVLA consistently outperforms specialist transformers on all manipulation horizons.
Strengths include scalability via abundant human video, robust generalization with limited robot data, and a unified representation. Limitations include the need for precise human hand/wrist annotations and the dependence on modest robotic teleoperation data for final adaptation. Zero-shot robot performance is negligible without such adaptation (Yang et al., 16 Jul 2025).
Summary Table: EgoVLA Key Characteristics
| Aspect | Details | Source |
|---|---|---|
| Architecture | NVILA-2B VLM, cross-modal fusion, proprioception, 6-layer action head transformer | (Yang et al., 16 Jul 2025) |
| Data | egocentric human video (500K samples), 4 datasets, MANO pose unification | (Yang et al., 16 Jul 2025) |
| Retargeting | MANO-based 15D actuation space, IK wrist alignment, 12-DOF hand MLP | (Yang et al., 16 Jul 2025) |
| Robotic Fine-Tuning | Isaac Sim demos, full model finetuning, moderate-scale data requirement | (Yang et al., 16 Jul 2025) |
| Benchmarks/Eval | Diverse bi-manual tasks, SR/PSR metrics, seen/unseen scene evaluation | (Yang et al., 16 Jul 2025) |
| Performance | 77.8% short, 45.9% long horizon SR (seen); strong generalization; outperforms baselines | (Yang et al., 16 Jul 2025) |
EgoVLA represents an end-to-end, generalist framework for robotic manipulation derived from egocentric human demonstrations, demonstrating how large-scale vision-language-action pretraining paired with principled retargeting can enable sample-efficient and robust policy learning for dexterous robot control (Yang et al., 16 Jul 2025).