EgoVLA: Egocentric Vision-Language-Action Model

Updated 2 July 2026

EgoVLA is a vision-language-action framework that integrates egocentric human videos to learn robotic manipulation policies.
It employs an NVILA-2B based backbone with cross-modal fusion and dedicated MLPs to align visual, language, and proprioceptive inputs.
Fine-tuning on robot teleoperation data achieves superior success rates, demonstrating robust generalization across diverse manipulation tasks.

EgoVLA is a class of Vision-Language-Action (VLA) models explicitly designed for learning robotic manipulation policies from large-scale egocentric human videos. It integrates multimodal perception, action prediction, and retargeting to robotic platforms, bridging the gap between human demonstration and policy transfer. By leveraging the diversity and scale of human egocentric video, EgoVLA models acquire transferrable manipulation priors, enabling effective few-shot generalization to robotic hardware after fine-tuning. The approach is structured around unified data representations, model architectures, and simulation-to-real transfer pipelines (Yang et al., 16 Jul 2025).

1. Model Architecture and Representation

The EgoVLA architecture is centered on a vision–language backbone based on NVILA-2B, incorporating both visual and linguistic modalities. The visual encoder processes 6 egocentric RGB frames ( $384 \times 384$ ), sampled at 0.2 s intervals, while a Transformer-based language encoder embeds a succinct instruction. Cross-modal attention layers then fuse these modalities into a shared latent representation $Z$ .

Human proprioceptive state at each timestep is encoded as wrist translation $\mathbf{T}_t \in \mathbb{R}^3$ , wrist rotation $\mathbf{R}_t \in SO(3)$ (rot6D), and hand pose parameters $\Theta_t \in \mathbb{R}^{15}$ (MANO PCA space). Each is passed through dedicated MLPs to generate embeddings that are integrated into $Z$ , aligning action representation geometry across domains.

The action head is a 6-layer Transformer (hidden size 1536) that accepts $H = 30$ action-query tokens and encoded proprioceptive state, autoregressively forecasting a sequence $\left\{ a_{t}, a_{t+1}, \ldots, a_{t+H} \right\}$ , where $a_{t+k} = (\mathbf{T}_{t+k}, \mathbf{R}_{t+k}, \Theta_{t+k})$ (Yang et al., 16 Jul 2025).

2. Training Data and Preprocessing

EgoVLA is pre-trained on approximately 500K egocentric video–action samples collected from four primary datasets: HOI4D, HOT3D, HoloAssist, and TACO. The data encompass diverse indoor settings and manipulative tasks spanning 151 unique tool–action–object combinations. Visual sequences are sampled at 3 FPS with 1 s sliding windows.

All hand/wrist poses are homogenized into the MANO model’s 15D PCA space, and wrist rotations are parameterized by rot6D. For language, ground-truth captions are used when available; otherwise, placeholders are inserted. The unified proprioceptive-action sequence aligns human and robot data for subsequent retargeting (Yang et al., 16 Jul 2025).

3. Human Action Prediction and Losses

The model’s objective is to accurately forecast short-horizon (1 s) future human hand and wrist poses. The composite loss function is: $\mathcal{L} = \lambda_{\mathrm{wrist\_trans}}\,\mathcal{L}_{\mathrm{wrist\_trans}} + \lambda_{\mathrm{wrist\_rot}}\,\mathcal{L}_{\mathrm{wrist\_rot}} + \lambda_{\mathrm{joint}}\,\mathcal{L}_{\mathrm{joint}}$ with

$Z$ 0

and weighting $Z$ 1, $Z$ 2, $Z$ 3.

This loss structure reflects the differing dynamic ranges and importances of translation, orientation, and hand articulation for downstream retargeting and control (Yang et al., 16 Jul 2025).

4. Retargeting Human Predictions to Robot Actions

To enable transfer of human demonstration to robot manipulators, EgoVLA establishes a unified actuation space via the MANO hand model. Given a robot hand pose, the system solves for human MANO parameters $Z$ 4 that minimize the average fingertip position error: $Z$ 5 where $Z$ 6 are the fingertip positions from the kinematic chain.

Predicted wrist pose $Z$ 7 in the device camera frame is mapped to the robot’s base frame, and inverse kinematics solves for robot joint angles $Z$ 8 via

$Z$ 9

At inference, a dedicated MLP (four layers, [64, 128, 64] hidden sizes) predicts 12-DOF robot hand commands, trained on retargeted human–robot demonstration pairs (Yang et al., 16 Jul 2025).

5. Fine-Tuning and Policy Adaptation on Robot Demonstrations

Following extensive human egocentric pretraining, the VLA is fine-tuned on moderate-scale robot teleoperation data from the Isaac Humanoid Manipulation Benchmark. Each of 12 manipulation tasks is associated with 100 successful teleoperation episodes, collected in simulation by Open-TeleVision with physical hand controllers.

The entire model (vision–language backbone and action head) is fine-tuned for 115 epochs (following 20 epochs of human-only pretraining), with a batch configuration of 16 videos × 8 action chunks × 4 GPUs. The schedule employs a constant cosine learning rate starting at $\mathbf{T}_t \in \mathbb{R}^3$ 0, decaying to $\mathbf{T}_t \in \mathbb{R}^3$ 1 after 100 epochs (Yang et al., 16 Jul 2025).

6. Evaluation: Ego Humanoid Manipulation Benchmark

The Ego Humanoid Manipulation Benchmark comprises 12 tasks subdivided into short-horizon (atomic) and long-horizon (multi-stage) manipulation scenarios:

Short-horizon: Push-Box, Flip-Mug, Pour-Balls, Close-Drawer, Open-Drawer, Open-Laptop, Stack-Can
Long-horizon: Sort-Cans, Insert-Cans, Unload-Cans, Insert-And-Unload-Cans, Stack-Can-Into-Drawer

Observations encompass egocentric RGB-D, end-effector states, hand joint configurations, and language instruction. Each action is a 36-dimensional vector, spanning both hands (6-D rot6D + 3-D translation + 15-D hand actuation per hand); the control frequency is 30 Hz.

The evaluation protocol rigorously tests generalization by exposing agents to both seen and 22 held-out (unseen) visual configurations, with randomized object positions and backgrounds. Key metrics:

Success Rate (SR): fraction of episodes where the final goal is reached
Progress Rate (PSR): fraction of subtasks completed (for composite/hierarchical tasks) (Yang et al., 16 Jul 2025)

7. Empirical Outcomes, Ablations, and Limitations

On seen backgrounds, EgoVLA achieves short-horizon SR of 77.8% (vs. 64.6% for no pretrain, 24.9% for ACT baseline) and long-horizon SR of 45.9% (vs. 26.7% no pretrain, 2.2% ACT). On unseen backgrounds, the short-horizon SR is 76.3% (vs. 62.6% EgoVLA-NoPretrain), and the long-horizon SR is 28.8% (vs. 11.2%).

Ablations reveal that halving the robot demonstration scale causes significant SR degradation (short: 77.8%→48.2%; long: 45.9%→7.4%). Increasing human video scale or diversity directly improves downstream SR/PSR. The PSR drop on unseen scenes is minor, suggesting failures cluster at the final subgoals rather than throughout the episode (Yang et al., 16 Jul 2025).

Key findings:

Pretraining on egocentric human videos imparts transferable manipulation priors and advances both in-domain and out-of-distribution generalization.
The unified MANO-based action space enables seamless robot–human alignment and efficient fine-tuning.
EgoVLA consistently outperforms specialist transformers on all manipulation horizons.

Strengths include scalability via abundant human video, robust generalization with limited robot data, and a unified representation. Limitations include the need for precise human hand/wrist annotations and the dependence on modest robotic teleoperation data for final adaptation. Zero-shot robot performance is negligible without such adaptation (Yang et al., 16 Jul 2025).

Summary Table: EgoVLA Key Characteristics

Aspect	Details	Source
Architecture	NVILA-2B VLM, cross-modal fusion, proprioception, 6-layer action head transformer	(Yang et al., 16 Jul 2025)
Data	egocentric human video (500K samples), 4 datasets, MANO pose unification	(Yang et al., 16 Jul 2025)
Retargeting	MANO-based 15D actuation space, IK wrist alignment, 12-DOF hand MLP	(Yang et al., 16 Jul 2025)
Robotic Fine-Tuning	Isaac Sim demos, full model finetuning, moderate-scale data requirement	(Yang et al., 16 Jul 2025)
Benchmarks/Eval	Diverse bi-manual tasks, SR/PSR metrics, seen/unseen scene evaluation	(Yang et al., 16 Jul 2025)
Performance	77.8% short, 45.9% long horizon SR (seen); strong generalization; outperforms baselines	(Yang et al., 16 Jul 2025)

EgoVLA represents an end-to-end, generalist framework for robotic manipulation derived from egocentric human demonstrations, demonstrating how large-scale vision-language-action pretraining paired with principled retargeting can enable sample-efficient and robust policy learning for dexterous robot control (Yang et al., 16 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EgoVLA.