RynnVLA-002: Unified Vision-Language-Action Model

Updated 24 November 2025

The paper introduces a unified transformer model that jointly generates action policies and simulates environmental dynamics, achieving 97.4% on the LIBERO benchmark.
RynnVLA-002 is a vision–language–action model that integrates text, image, state, and action modalities through a shared 65,536-token vocabulary for robust robotic decision-making.
It innovates with techniques such as action-attention masking and a continuous action head, which enhance training efficiency and improve real-world robotic performance.

RynnVLA-002 is a unified Vision–Language–Action (VLA) and world model that leverages a single transformer backbone for joint action policy generation and environmental dynamics modeling. Built atop the Chameleon architecture, RynnVLA-002 processes text, images, proprioceptive states, and robot actions using a unified vocabulary of 65,536 discrete tokens, enabling multi-modal integration at the token level. The model demonstrates strong simulation and real-world performance, established empirical state-of-the-art (SOTA) on the LIBERO benchmark, and offers new techniques for action chunking, world-model integration, and action decoding. Its architecture, training methodology, loss formulation, and empirical ablations define a comprehensive framework for robotic decision-making and planning in vision-language-action settings (Cen et al., 21 Nov 2025).

1. Unified Architecture and Tokenization

RynnVLA-002 is structured around the Chameleon LLM backbone, extended to accept and generate robot action and proprioceptive state tokens in addition to text and images. The model supports both VLA policy execution and world-model prediction within a single architecture. The four tokenized modalities—text, image, proprioceptive state, and action—share a single, discrete vocabulary, allowing all modalities to be processed by the same transformer model.

Vision–Language–Action Policy (VLA): Given a natural language instruction $l$ , a window of image observations $o_{t-h:t}$ , and the previous proprioceptive state $s_{t-1}$ , the model generates a chunk of $K$ future actions $a_{t:t+K-1}$ either autoregressively or in parallel.
World Model: Conditioned on the history of images and actions, the world model predicts future image tokens, effectively simulating the scene’s evolution under hypothetical action sequences.

Action Generation Heads

RynnVLA-002 features two action generation heads:

Discrete, Autoregressive Head: Generates quantized action tokens (using 256 bins per action dimension) jointly with text and image tokens, sharing the main transformer context.
Continuous Action Transformer Head: A smaller transformer module that, given the fused context and learnable action queries, outputs $K$ continuous actions in parallel, supporting smoother and faster trajectory inference.

A key architectural novelty is the action-attention mask, which restricts the discrete action chunk generator such that each predicted action cannot attend to previous actions within the chunk, breaking the typical autoregressive error accumulation found in standard transformers (Cen et al., 21 Nov 2025).

2. Mathematical Formulation

The model’s objectives unify policy, world-modeling, and action generation in a shared framework. The formalism adopts standard notation:

$l$ : language instruction
$o_{t-h:t}$ : sequence of image tokens (historical context)
$s_{t-1}$ : proprioceptive state at the prior step
$a_{t:t+K-1}$ : future actions in a chunk

Core Operations

VLA Policy:

$a_t \sim \pi(a_t \mid l,\,s_{t-1},\,o_{t-h:t})$

World-Model Prediction:

$\hat o_t \sim f\bigl(o_t\mid o_{t-h:t-1},\,a_{t-h:t-1}\bigr)$

Loss Functions

Discrete Action Loss:

$\mathcal{L}_{\mathrm{dis\_action}} = -\sum_{t=1}^K\sum_{i=1}^{D_a}\log p\bigl(a_t^{(i)}\bigr)$

Image Prediction Loss:

$\mathcal{L}_{\mathrm{img}} = -\sum_{u=1}^N\sum_{j=1}^{D_x}\log p\bigl(x_{t+u}^{(j)}\bigr)$

Continuous Action Loss:

$\mathcal{L}_{\mathrm{conti\_action}} = \sum_{t=1}^K \|\,\hat a_t - a_t\|_1$

Unified Objective:

$\mathcal{L} = \mathcal{L}_{\mathrm{dis\_action}} + \mathcal{L}_{\mathrm{img}} + \alpha\,\mathcal{L}_{\mathrm{conti\_action}}$

with $\alpha = 10$ .

For alternate cases, the framework allows a mean-squared error (MSE) world-model loss and policy cross-entropy, as in:

$\mathcal{L}_{\mathrm{world}} = \mathbb{E}_t \| \hat x_{t+1} - x_{t+1} \|^2,$

$\mathcal{L}_{\mathrm{VLA}} = -\sum_t \log p(a_t | x_t),$

with aggregate objective

$\mathcal{L} = \mathcal{L}_{\mathrm{world}}+\lambda\,\mathcal{L}_{\mathrm{VLA}}.$

3. Training Procedure and Data Recipe

The model is initialized from Chameleon, already pretrained on large-scale image–text corpora. Modality-specific tokenizers include:

Image: VQ-GAN (compression ratio 16, codebook size 8192)
Text: BPE tokenizer from Chameleon
State/Action: Uniform 256-bin quantization per dimension

For each training batch, the data loader samples at random either a VLA instance (text + state + image history → action chunk) or a world-model instance (prompt + image + action → next image). Hyperparameters include context history $M=2$ , action chunk size $K=10$ (Spatial/Long) or $K=5$ (Object/Goal), autoregressive prediction steps $N=1$ , and loss coefficient $\alpha=10$ . The optimizer and scheduler are inherited from the Chameleon training code.

No reinforcement learning or explicit curriculum is used in the canonical protocol, though “cold-start” world-model pretraining yields a ~6% gain on Goal tasks. Efficient chunk-based action generation with the attention mask and optional inclusion of wrist camera frames or proprioceptive state further enhances training effectiveness.

4. Empirical Results and Benchmark Performance

Simulation: LIBERO Benchmark

Evaluation covers four suites: Spatial, Object, Goal, and Long (each with 50 distinct resets per task). No external pretraining on robot data is employed.

Variant	Spatial	Object	Goal	Long	Average
Discrete actions	94.2%	96.8%	94.6%	87.6%	93.3%
Continuous actions	99.0%	99.8%	96.4%	94.4%	97.4%

Continuous actions converge more rapidly and generalize better. Ablation studies reveal:

World-model inclusion boosts discrete avg from 62.8% → 67.2% and continuous from 91.6% → 94.6%.
The chunking attention mask lifts discrete avg from 54.0% → 76.6%.
Aggregate (world + chunking + mask) lifts performance to 93.3%.
Adding wrist-camera/proprioceptive input to the continuous variant further increases success from 94.6% → 97.4%.

Real-world Robotics: LeRobot SO100

Tasks include Block→Circle and Strawberries→Cup, tested in single/multi-target and distractor scenarios:

Task & Setting	RynnVLA-002	GR00T N1.5	$\pi_0$
Block task	90/90/80%	90/60/50%	100/70/50%
Strawberry task	80/80/50%	50/50/70%	80/70/40%

Ablations show that removal of world-model training data results in a steep drop (to 0–30%) in success, while world-model inclusion yields >80% even in challenging distractor settings. The continuous action transformer is notably more efficient (15–48 Hz) compared to discrete modes (~2.5–3.7 Hz, depending on chunking and attention masks).

5. Innovations and Contributions

RynnVLA-002 delivers several key advances:

Full Modality Unification: The first model to merge a VLA policy and a world model on a single transformer, supporting shared planning and imagination in robot control.
Action-attention Mask: A custom attention mechanism for discrete action chunking that eliminates intra-chunk error propagation, sharply increasing simulated task success.
Continuous Action Transformer Head: Provides smoother, real-robot-friendly trajectories, higher generalization, and accelerated batch inference.
Mutual Enhancement: World-model data enhances VLA performance (by 5–10 points in simulation, ~50% in real) and VLA training reciprocally improves world-model metrics (FVD decreases, SSIM increases, cross-view consistency).
Empirical SOTA: Achieves 97.4% mean on LIBERO without external pretraining, and 90%+ success in real-world robotic manipulation tasks.

6. Limitations and Open Problems

Known limitations include:

Action chunk length is bounded: performance degrades beyond 10 steps, indicating limited replanning capability with very long action sequences.
Training demands a mix of high-quality human-provided demonstrations. This constrains scalability to broader, noisier, in-the-wild data.
Dedicated world-model pretraining appears promising but is only partially explored.

A plausible implication is that the tight integration of planning and world-modeling in a single architecture may be essential for next-generation embodied AI; however, data collection, action-tokenization granularity, and long-horizon reasoning remain active challenges (Cen et al., 21 Nov 2025).

7. Context and Impact

RynnVLA-002 represents a significant consolidation of trends toward unified, multi-modal robot learning. By combining multi-modal transformer strategies with both explicit world-modeling and direct policy optimization, it sets new standards for empirical robustness and practical deployment without reliance on large-scale robot-specific pretraining. Its mutual enhancement principle implies that the synergy between simulated “imagination” (future state prediction) and action planning is critical for efficient, scalable, and generalizable visuo-motor agents. The approach charts a path toward comprehensive vision–language–action–world models for robotics applications.

PDF Markdown Chat (Pro)

References (1)

RynnVLA-002: A Unified Vision-Language-Action and World Model (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to RynnVLA-002.