RD-VLA: Recurrent-Depth Vision-Language-Action

Updated 12 February 2026

RD-VLA is a model architecture that utilizes recurrent latent iterative refinement to enable adaptive, multi-step reasoning in vision-language-action tasks.
It leverages a shared, weight-tied recurrent core to maintain constant memory during inference, overcoming the linear scaling challenges of token-based methods.
Empirical results on the LIBERO benchmark demonstrate significant performance gains and efficiency improvements over traditional chain-of-thought approaches in robotic manipulation.

Recurrent-Depth Vision-Language-Action (RD-VLA) is a model architecture designed for adaptive computational allocation in vision-language-action learning, achieving controllable test-time reasoning depth via latent iterative refinement rather than explicit token-based reasoning. RD-VLA enables efficient execution and adaptivity in robotic manipulation tasks, supporting arbitrary recurrent depth at inference with constant memory and leveraging a single shared recurrent core for internal latent state updates. This design contrasts sharply with previous approaches such as chain-of-thought (CoT) prompting, which rely on token-level autoregression and incur linear memory scaling, impeding their scalability and efficiency in high-dimensional, continuous action spaces (Tur et al., 8 Feb 2026).

1. Architecture and Design Principles

RD-VLA employs a backbone-agnostic vision-language encoder and a recurrent, weight-tied action head. In the reported experiments, the model utilizes a Qwen2.5-0.5B LLM and a frozen DINOv2+SigLIP vision encoder. Each image yields 256 vision tokens (or 512 with dual cameras), projected into the LLM via cross-modal LoRA finetuning. The input is prepended with 64 learned latent tokens. Post LLM processing (24 layers), embeddings are partitioned into mid-layer ( $h_{\text{vis}+\text{lat}}^{(12)}$ ) and final-layer ( $h_{\text{vis}+\text{lat}}^{(24)}$ ) features, serving as conditioning for downstream modules.

The action head comprises three primary modules:

Prelude ( $P_\phi$ ): Accepts $K$ learned queries (K=8), applies cross-attention over $h_{\text{vis}+\text{lat}}^{(12)}$ , and outputs $S_\text{pre} \in \mathbb{R}^{K\times D}$ as a grounded latent foundation.
Scratchpad Initialization: Initializes a latent scratchpad $S_0 \in \mathbb{R}^{K\times D}$ via sampling from $\text{TruncNormal}(0, \gamma_\text{init} \sigma_\text{init})$ , enforcing iterative refinement over mere shortcutting or memorization.
Recurrent Core ( $R_\theta$ ): At each recurrent iteration $k=1,\ldots,r$ , $h_{\text{vis}+\text{lat}}^{(24)}$ 0 and $h_{\text{vis}+\text{lat}}^{(24)}$ 1 are concatenated, projected, and RMS-normalized to yield $h_{\text{vis}+\text{lat}}^{(24)}$ 2. The core (a standard Transformer block) updates the scratchpad as $h_{\text{vis}+\text{lat}}^{(24)}$ 3, where $h_{\text{vis}+\text{lat}}^{(24)}$ 4 denotes robot proprioception.
Coda ( $h_{\text{vis}+\text{lat}}^{(24)}$ 5): After $h_{\text{vis}+\text{lat}}^{(24)}$ 6 iterations, $h_{\text{vis}+\text{lat}}^{(24)}$ 7 is decoded to actions as $h_{\text{vis}+\text{lat}}^{(24)}$ 8.

Weight tying across all $h_{\text{vis}+\text{lat}}^{(24)}$ 9 iterations ensures memory requirements scale only with a single instance of $P_\phi$ 0, irrespective of unroll depth.

2. Mathematical Formalization

RD-VLA's iterative refinement of latent state and action is defined by:

Initialization: $P_\phi$ 1, $P_\phi$ 2.
Iterative Update: For $P_\phi$ 3,

$P_\phi$ 4

$P_\phi$ 5

Action Decoding:

$P_\phi$ 6

In shorthand, letting $P_\phi$ 7 and $P_\phi$ 8: $P_\phi$ 9

Training Loss: The number of unrolled steps $K$ 0 is sampled from a heavy-tailed lognormal-Poisson distribution. At each step, predictions are supervised using mean-squared error, $K$ 1. Memory is controlled using truncated backpropagation through time (TBPTT) over only the last $K$ 2 steps: $K$ 3
Adaptive Stopping: At inference, iteration stops when $K$ 4, typically with $K$ 5. Alternatives include normalizing latent-space deltas.

3. Training and Inference Methodology

RD-VLA is trained by randomly varying recurrence lengths (with log-normal Poisson sampling), ensuring robust convergence from diverse initial latent states and mitigating overfitting to a fixed number of steps. TBPTT restricts memory requirements to the final $K$ 6 steps, decoupling unroll length from resource usage. No explicit curriculum is employed beyond recurrence sampling; the model self-organizes to achieve convergence via latent refinement for complex multi-step problems or simple one-shot responses as needed.

At test time, dynamic computational allocation is performed by setting an iteration cap $K$ 7 (e.g., 32), initializing $K$ 8 and $K$ 9, and iterating until the action delta criterion is met or $h_{\text{vis}+\text{lat}}^{(12)}$ 0 is reached. Compute is automatically concentrated on samples requiring extended reasoning, aligning inference effort with problem complexity.

The inference procedure is summarized by the following pseudo-code:

$S_\text{pre} \in \mathbb{R}^{K\times D}$ 0

4. Empirical Performance

RD-VLA has demonstrated state-of-the-art results on the LIBERO manipulation benchmark. Performance scales sharply with increased recurrent depth, as reflected in the following results:

Recurrence $h_{\text{vis}+\text{lat}}^{(12)}$ 1	Spatial	Object	Goal	Long	Avg
1	9.0	12.2	11.4	1.0	8.4
2	38.0	61.2	47.6	15.0	40.5
4	79.2	93.0	89.2	74.8	84.1
8	93.0	97.8	94.2	85.2	92.6
12	92.0	99.0	96.0	84.8	93.0
24	92.4	99.2	94.2	86.6	93.1

A pronounced performance jump occurs between $h_{\text{vis}+\text{lat}}^{(12)}$ 2 (8.4%) and $h_{\text{vis}+\text{lat}}^{(12)}$ 3 (84.1%). Certain long-horizon tasks exhibit 0% success at $h_{\text{vis}+\text{lat}}^{(12)}$ 4 but over 90% at $h_{\text{vis}+\text{lat}}^{(12)}$ 5. For adaptive compute with stopping threshold $h_{\text{vis}+\text{lat}}^{(12)}$ 6, the architecture achieves $h_{\text{vis}+\text{lat}}^{(12)}$ 7 average with $h_{\text{vis}+\text{lat}}^{(12)}$ 8 iterations—saving 34% of computation over fixed $h_{\text{vis}+\text{lat}}^{(12)}$ 9 at negligible performance cost.

Comparisons to previous VLA approaches show that RD-VLA (with 0.5B parameters) surpasses 3B parameter end-to-end and token-reasoning models, while achieving up to 80-fold inference speed-up over autoregressive chain-of-thought methods.

Method	Params	Avg Success (%)
π₀-FAST (E2E)	3B	85.5
Fast-ThinkAct (Token)	3B	89.7
RD-VLA (fixed 12)	0.5B	93.0
RD-VLA (adaptive)	0.5B	92.5

Because latent reasoning is continuous and does not require token emission, memory footprint is independent of reasoning depth at test time.

5. Implications for Continuous Action Reasoning

RD-VLA's reliance on latent iterative refinement in a continuous, non-tokenized space eliminates discretization artifacts and information bottlenecks imposed by chain-of-thought or diffusion-policy approaches. The design's weight-tied recurrent core imparts an "anytime" property: the plan can be refined to arbitrary quality by simply allocating more computational steps, constrained by a simple convergence criterion.

A plausible implication is that RD-VLA may be especially advantageous in real-world or safety-critical robotics, where action discretization or inefficiencies arising from linear scaling of memory with reasoning steps would be prohibitive. The exposure of convergence metrics further allows for interventions when latent divergence signifies uncertainty.

6. Limitations and Prospects

Beyond the empirically optimal recurrence depth (8–12 steps, task-dependent), further unrolling may result in "state saturation" or minor performance decay. Current adaptive stopping relies on action-space mean squared error; enhanced latent metrics such as cosine distance or KL divergence could refine convergence detection. Scaling to larger vision-LLM backbones (e.g., 7B or 50B) is expected to further improve out-of-distribution generalization. Hybrid systems that combine latent recurrence with occasional tokenized subgoals, as well as safety interventions triggered by divergence measures, are potential extensions.

RD-VLA demonstrates that purely latent, recurrent reasoning is sufficient for high success rates in long-horizon robotic control. The architecture enables test-time compute scaling, constant memory operation, and substantial acceleration relative to token-based reasoning policies, thus establishing a new paradigm for adaptive and efficient robotic policy synthesis (Tur et al., 8 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Recurrent-Depth Vision-Language-Action (RD-VLA).