Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 79 tok/s

Gemini 2.5 Pro 30 tok/s Pro

GPT-5 Medium 29 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 116 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation (2509.15212v1)

Published 18 Sep 2025 in cs.CV and cs.RO

Abstract: This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.

Summary

The paper introduces a multi-stage pretraining pipeline that transfers human manipulation priors to enhance robot vision-language-action manipulation.
It integrates an autoregressive I2V model, trajectory-aware video modeling, and a robot-centric adaptation to predict coherent action sequences.
Experimental results on three real-world tasks show success rates over 90%, demonstrating robust performance even in cluttered, distractor-rich environments.

RynnVLA-001: Leveraging Human Demonstrations for Enhanced Vision-Language-Action Robot Manipulation

Introduction and Motivation

RynnVLA-001 addresses the persistent challenge of data scarcity in Vision-Language-Action (VLA) models for robotic manipulation. While large-scale datasets have propelled advances in LLMs and VLMs, the collection of robot manipulation data remains labor-intensive and limited in scale. RynnVLA-001 proposes a multi-stage pretraining pipeline that exploits the abundance of ego-centric human manipulation videos to transfer manipulation priors to robotic agents. The approach is characterized by a curriculum that transitions from visual prediction to action-oriented modeling, culminating in a robot-centric VLA model with strong generalization and instruction-following capabilities.

Figure 1: The RynnVLA-001 training data pipeline integrates ego-centric video pretraining, human-centric trajectory-aware modeling, and robot-centric VLA modeling.

Multi-Stage Pretraining Pipeline

Ego-Centric Video Generative Pretraining

The first stage involves training an autoregressive transformer-based Image-to-Video (I2V) model on 12 million ego-centric human manipulation videos. The model is conditioned on an initial frame and a language instruction, and is tasked with predicting future frames. This stage is designed to capture the physical dynamics of manipulation from a first-person perspective, aligning the pretraining task with the downstream requirements of VLA models.

Human-Centric Trajectory-Aware Video Modeling

To bridge the gap between visual prediction and action generation, the second stage introduces joint prediction of future frames and human keypoint trajectories. Using datasets such as EgoDex, the model is finetuned to predict both visual tokens and compact trajectory embeddings (wrist keypoints) generated by a domain-specific ActionVAE. This multi-task objective enables the model to associate visual changes with underlying motion, facilitating transfer to robot action spaces.

Robot-Centric Vision-Language-Action Modeling

In the final stage, the pretrained model is adapted to robot-centric data. The architecture is extended to process dual-view visual observations (front and wrist cameras), robot state embeddings, and language instructions. The model predicts action embeddings, which are decoded by a robot-specific ActionVAE into executable action sequences. The action head is re-initialized to accommodate the kinematic differences between human and robot embodiments.

Figure 2: The RynnVLA-001 architecture and training stages, illustrating the progressive transfer from video prediction to trajectory-aware modeling and finally to robot-centric VLA.

ActionVAE: Compact Action Representation

A key innovation is the use of ActionVAE to encode action chunks into low-dimensional latent embeddings. This design choice addresses two issues: (1) it avoids the inefficiency and instability of single-step action prediction, and (2) it provides a smooth, temporally coherent representation of action sequences. Separate ActionVAEs are trained for human and robot domains, ensuring embodiment-specific encoding. During inference, the VLA model predicts an action embedding, which is decoded into a chunk of low-level actions for execution.

Data Curation and Annotation

The ego-centric video dataset is curated via a multi-stage pipeline: pose estimation is used to extract keypoints, videos are filtered for ego-centric perspectives (absence of facial keypoints, presence of hands), and concise language instructions are generated using Qwen2-VL-7B. This ensures that the pretraining data is both relevant and aligned with the downstream VLA task.

Experimental Evaluation

Task Suite and Baselines

RynnVLA-001 is evaluated on a suite of real-world manipulation tasks using the LeRobot SO100 arm: (1) pick up and place green blocks, (2) pick up and place strawberries, and (3) grab pen and put it into holder. Each task is tested under single-target, multi-target, and instruction-following-with-distractors settings.

Figure 3: Evaluation tasks for RynnVLA-001, covering diverse manipulation scenarios and instruction-following with distractors.

Baselines include GR00T N1.5 and Pi0, both finetuned on the same robot data. RynnVLA-001 achieves substantially higher success rates across all tasks and settings, with average success rates exceeding 90% in most cases. Notably, the model maintains robust performance in the presence of distractors, indicating strong language grounding and visual discrimination.

Effectiveness of Pretraining

Ablation studies demonstrate that both stages of pretraining are critical. Models trained from scratch or initialized from generic image-text models (e.g., Chameleon T2I) perform significantly worse. The addition of trajectory-aware pretraining yields further gains, confirming the importance of explicitly modeling the transition from visual dynamics to action generation.

Model Design Ablations

Image Resolution: Lowering the input resolution degrades performance due to VQGAN reconstruction artifacts, highlighting the importance of high-fidelity visual tokens.
Action Representation: Predicting VAE-encoded action embeddings outperforms direct raw action prediction, yielding smoother and more consistent behaviors.
Action Head Complexity: A single linear layer suffices for action decoding; deeper MLP heads introduce overfitting and reduce performance.

Video Generation and Instruction-Following

The pretrained I2V model generates plausible motion sequences from static images and text prompts, serving as an effective backbone for downstream VLA adaptation.

Figure 4: The I2V model generates temporally consistent video frames conditioned on an image and text prompt.

Instruction-following robustness is directly linked to the diversity of training data. Models trained without distractors fail to generalize in cluttered scenes, while the full RynnVLA-001 model achieves high success rates in such settings.

Sensor Configuration and Spatial Reasoning

The dual-camera setup (front and wrist cameras) is essential for robust manipulation. The front camera provides coarse localization and 3D projective context, while the wrist camera enables fine-grained adjustments. Disabling or altering the front camera's viewpoint leads to catastrophic failures in tasks requiring spatial reasoning or when targets are outside the wrist camera's field of view.

Figure 5: The front camera is critical for coarse localization; masking it leads to failure when targets are not visible to the wrist camera.

Figure 6: The front camera's 3D perspective is necessary for precise spatial reasoning; altering its geometry impairs task success.

Implications and Future Directions

RynnVLA-001 demonstrates that large-scale human demonstration data, when properly curated and integrated via a multi-stage curriculum, can significantly enhance the generalization and robustness of VLA models for robotic manipulation. The explicit modeling of the transition from visual prediction to action generation, combined with compact action representations, yields a system that outperforms prior state-of-the-art approaches in both success rate and instruction-following fidelity.

The results suggest several avenues for future research:

Extending the approach to a broader range of robot embodiments and unstructured environments to assess generalization.
Investigating the impact of more diverse camera configurations and sensor modalities.
Exploring the integration of richer proprioceptive and tactile feedback for fine-grained manipulation.
Scaling the approach to more complex, long-horizon tasks and multi-agent settings.

Conclusion

RynnVLA-001 establishes a new paradigm for VLA model pretraining by leveraging large-scale ego-centric human demonstrations and trajectory-aware modeling. The architecture, data curation, and action representation strategies collectively yield a model with superior manipulation capabilities and instruction-following robustness. The findings underscore the value of curriculum-based pretraining and embodiment-specific action encoding for advancing generalist robotic agents.