Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 170 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 41 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

RynnVLA-001: VLA Model for Robotic Manipulation

Updated 21 September 2025
  • RynnVLA-001 is a vision-language-action model that advances robotic manipulation by leveraging two-stage pretraining from human demonstrations.
  • The model integrates language, vision, state, and action tokens within a transformer backbone and employs ActionVAE to compress complex action sequences.
  • Evaluation on real-world tasks shows enhanced precision, multi-object handling, and robust long-horizon instruction-following performance.

RynnVLA-001 is a vision-language-action (VLA) model developed to advance robot manipulation by leveraging large-scale video generative pretraining from human demonstrations. It features a two-stage pretraining methodology—first, ego-centric video generative pretraining, and subsequently, human-centric trajectory-aware modeling—enabling the model to transition from high-level visual understanding to low-level action prediction. The architecture interleaves language, vision, and action embeddings within a transformer backbone and introduces ActionVAE, a variational autoencoder that compresses action sequences into latent representations, effectively reducing output complexity. RynnVLA-001 demonstrates superior performance over state-of-the-art baselines on real-world robotic manipulation benchmarks.

1. Model Architecture

RynnVLA-001's architecture is centered on a multi-modal autoregressive transformer adapted from Chameleon, a text-to-image generation model. Its input sequence is structured to interleave language tokens and discrete visual tokens during early pretraining. For robot-specific tasks, state embeddings (such as proprioceptive wrist positions) and action placeholders are interleaved with language and visual tokens.

A lightweight action head, implemented as a single linear projection, is attached to the transformer at positions corresponding to the special <ACTION_PLACEHOLDER> token. This design facilitates direct mapping from hidden transformer representations to continuous action embedding space. In the final, robot-centric stage, the input sequence incorporates dual camera views (front and wrist) to capture both coarse and fine spatial manipulations.

2. Two-Stage Pretraining Methodology

The pretraining process of RynnVLA-001 comprises two sequential stages, each targeting a different aspect of manipulation learning:

  1. Ego-Centric Video Generative Pretraining: The model is trained as an Image-to-Video (I2V) transformer on 12 million ego-centric human manipulation videos. Conditioned on a single initial frame and a language instruction, the model learns to predict subsequent frames. The input format for this stage is:

[language tokens,visual tokenst,language tokens,visual tokenst+1,][\text{language tokens}, \text{visual tokens}_t, \text{language tokens}, \text{visual tokens}_{t+1}, \ldots]

Supervision utilizes cross-entropy loss on discrete tokens.

  1. Human-Centric Trajectory-Aware Video Modeling: In this stage, the pretrained model is further fine-tuned with human demonstration videos that include keypoint (especially wrist) annotations. Instead of directly predicting keypoint trajectories, RynnVLA-001 is trained to predict compact latent embeddings of these trajectories, produced by ActionVAE. The sequence expands to include state embeddings and action placeholders as:

[language tokens,visual tokenst,state embeddingt,ACTION_PLACEHOLDER,][\text{language tokens}, \text{visual tokens}_t, \text{state embedding}_t, \langle \text{ACTION\_PLACEHOLDER} \rangle, \ldots]

The objective combines cross-entropy loss for visual token prediction and an L1 loss applied to continuous action embeddings at <ACTION_PLACEHOLDER> positions:

Ltotal=Lcross-entropy+λL1(action prediction)L_\text{total} = L_\text{cross-entropy} + \lambda \cdot L_1(\text{action prediction})

3. ActionVAE: Latent Action Representation

ActionVAE is a domain-specific variational autoencoder constructed to learn compact, smooth representations of action sequences—referred to as "action chunks". It is trained to encode both human and robot trajectories into low-dimensional latent spaces, enabling the model to efficiently predict temporally coherent, consistent actions.

Two separate ActionVAEs are trained: one for human demonstrations, mostly utilized in Stage 2, and another for robot actions in downstream tasks. By substituting the direct prediction of raw action sequences with latent embeddings, the overall complexity of the output space is reduced, facilitating smoother trajectory generation and more robust execution.

4. Evaluation and Performance

RynnVLA-001 is benchmarked on multiple real-world manipulation tasks using the LeRobot SO100 arm. Tasks include pick-and-place operations with green blocks and strawberries, as well as precision insertion of a pen into a holder. Evaluation metrics include task-specific success rates, overall average success rate, and SR@1—the percentage of tasks completed in a single trial.

Performance comparisons are provided against baseline models GR00T N1.5 and Pi0, demonstrating that RynnVLA-001 yields notably higher success rates, particularly in multi-target manipulation and instruction-following tasks with distractor objects. Detailed result tables in the source material show marked improvements, especially in challenging, long-horizon scenarios. This suggests that the two-stage pretraining pipeline and ActionVAE embedding significantly improve the model's ability to generalize from human demonstrations to robot control.

5. Applications in Robotic Manipulation

RynnVLA-001 is purpose-built for complex manipulation scenarios requiring precision and flexibility. The model is deployed for tasks such as pick-and-place, precision insertion, and multi-object handling with distractors. Notably, its ability to decode high-level natural language instructions and to plan and execute long-horizon manipulation strategies enables its application in unstructured environments that demand generalized, robust policy learning.

Demonstrated versatility arises from the model architecture’s capability to integrate multimodal sensory inputs and compress rich, sequential action semantics into effective control policies. Its task success rates underscore efficacy in both coarse object localization (via the front camera) and fine-grained manipulation (via the wrist camera).

6. Input Sequencing and Loss Formulations

The model utilizes interleaved input sequencing to marshal language, visual, state, and action information:

  • Ego-centric pretraining:

[language tokens,visual tokenst,language tokens,visual tokenst+1,][\text{language tokens}, \text{visual tokens}_t, \text{language tokens}, \text{visual tokens}_{t+1}, \ldots]

  • Trajectory-aware stage:

[language,visual tokenst,state embeddingt,ACTION_PLACEHOLDER,][\text{language}, \text{visual tokens}_t, \text{state embedding}_t, \langle \text{ACTION\_PLACEHOLDER} \rangle, \ldots]

Training losses consist of a cross-entropy component for discrete visual token prediction and an L1 regularization term for continuous action embedding supervision:

Ltotal=Lcross-entropy+λL1(action prediction)L_\text{total} = L_\text{cross-entropy} + \lambda \cdot L_1(\text{action prediction})

These interleaved representations and hybrid loss functions are integral to the model’s ability to jointly optimize for visual prediction consistency and action trajectory coherence.

7. Limitations and Future Directions

Several avenues for further development are highlighted:

  • Broadening the range of robotic embodiments and the diversity of environmental settings in evaluation to enhance generalization.
  • Addressing the dependence on fixed camera positions, which currently facilitate coarse localization from the front camera but may hinder adaptability to variable viewpoints.
  • Improving performance on long-horizon tasks and refining object localization, as lower SR@1 scores across all methods indicate ongoing challenges with sustained planning and fine spatial discrimination.
  • Investigating more sophisticated action prediction paradigms and alternative action head architectures without incurring excessive architectural complexity.

A plausible implication is that further integration of multimodal input sources and more expressive action encoding schemes may further bridge the gap between human demonstration and robotic control, especially in unstructured and dynamic environments.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RynnVLA-001.