Papers
Topics
Authors
Recent
2000 character limit reached

StereoVLA: Stereo Vision-Language Action Model

Updated 2 January 2026
  • StereoVLA is a vision-language-action model that integrates stereo-derived geometric cues with semantic features to boost robotic manipulation accuracy.
  • The model fuses dual-camera stereo cost volumes with pretrained semantic signals via a Geometric-Semantic Feature Extraction module, ensuring precise depth estimation in interaction regions.
  • Its auxiliary IRDE task and robust design deliver notable gains—up to 33% improvement and resilience against camera pose variations—over monocular and multi-view baselines.

StereoVLA is a vision-language-action (VLA) model that explicitly leverages stereo vision, mimicking human binocular perception, to enhance spatial awareness and precision in robotic manipulation. The methodology strategically integrates advanced geometric feature processing from stereo image pairs with semantic signals from foundation models, supporting robust performance in diverse manipulation tasks, particularly where depth or fine-grained 3D structure is critical. Key contributions of StereoVLA include a Geometric-Semantic Feature Extraction module, a targeted Interaction-Region Depth Estimation (IRDE) auxiliary task, and systematic evaluations establishing strong improvements over leading monocular and multi-view baselines (Deng et al., 26 Dec 2025).

1. Geometric-Semantic Feature Extraction in StereoVLA

At the core of StereoVLA is the Geometric-Semantic Feature Extraction module, which processes synchronized left and right camera images

IL,  IRRH×W×3I_L,\; I_R \in \mathbb{R}^{H\times W\times 3}

to produce a fused representation encoding both stereo-derived geometric structure and monocular semantic context.

Geometric Feature Extraction

StereoVLA employs a shared-weight unary extractor Φ\Phi:

fL=Φ(IL),fR=Φ(IR);fL,fRRC×H4×W4f_L = \Phi(I_L),\quad f_R = \Phi(I_R)\,;\quad f_{L},f_{R}\in\mathbb{R}^{C\times \tfrac{H}{4}\times \tfrac{W}{4}}

to generate spatially-reduced feature maps. Channel-wise concatenation forms a 4D stereo cost volume:

$V_c = \mathrm{concat}(f_L,\,f_R)\;\in\;\mathbb{R}^{2C\times \tfrac{D}{4}\times \tfrac{H}{4}\times \tfrac{W}{4}$

with DD as the maximum disparity range. A hybrid cost-filtering module Ψ\Psi (attention plus convolution) produces filtered features:

$V_c' = \Psi(V_c)\quad\in\;\mathbb{R}^{C'\times \tfrac{D}{4}\times \tfrac{H}{4}\times \tfrac{W}{4}$

defining the dense geometric tensor g=fgeom(IL,IR)=Vcg = f_{\mathrm{geom}}(I_L, I_R) = V_c'.

Semantic Feature Extraction

To supplement depth cues with rich semantics, StereoVLA applies two pretrained semantic "foundation" heads to ILI_L:

  • SigLIP: ssig=fsig(IL)RK1×Hs×Wss_{\mathrm{sig}} = f_{\mathrm{sig}}(I_L) \in \mathbb{R}^{K_1\times H_s\times W_s}
  • DINOv2: sdino=fdino(IL)RK2×Hs×Wss_{\mathrm{dino}} = f_{\mathrm{dino}}(I_L) \in \mathbb{R}^{K_2\times H_s\times W_s}

Combined, s=fsem(IL)=[ssig;sdino]R(K1+K2)×Hs×Wss = f_{\mathrm{sem}}(I_L) = [s_{\mathrm{sig}}; s_{\mathrm{dino}}] \in \mathbb{R}^{(K_1+K_2)\times H_s\times W_s}.

Feature Fusion

Spatial pooling along the disparity axis aligns gg to semantic map size:

gp=Pool(g)    RC×Hs×Wsg_p = \mathrm{Pool}(g)\;\in\;\mathbb{R}^{C''\times H_s\times W_s}

Final fusion concatenates along channel dimension:

u=[gp;ssig;sdino]R(C+K1+K2)×Hs×Wsu = [g_p;\,s_{\mathrm{sig}};\,s_{\mathrm{dino}} ] \in \mathbb{R}^{(C''+K_1+K_2)\times H_s\times W_s}

The result is flattened and projected:

h=ffuse(g,s)=MLP(Flatten(u))RN×Dhh = f_{\mathrm{fuse}}(g,s) = \mathrm{MLP}(\mathrm{Flatten}(u)) \in \mathbb{R}^{N\times D_h}

where N=HsWsN=H_s W_s, DhD_h is the token dimension. This token sequence forms the unified visual embedding for subsequent vision-language processing.

2. Auxiliary Interaction-Region Depth Estimation (IRDE)

StereoVLA introduces the Interaction-Region Depth Estimation (IRDE) as an auxiliary training task, focusing depth supervision on the areas most relevant for manipulation.

Task Definition and Loss

During training, a point (x,y)(x,y) is sampled within the "interaction region"

R=BBoxobjectBBoxgripperR = \mathrm{BBox}_{\mathrm{object}} \cup \mathrm{BBox}_{\mathrm{gripper}}

as computed on ILI_L. The system predicts a depth value d^(x,y)\hat d(x,y) using the fused visual-language features, with ground-truth supervision dgt(x,y)d_{\mathrm{gt}}(x,y) available from simulation.

Depth prediction is formulated as a classification over discretized bins, using cross-entropy:

LIRDE=k=1Kyd(k)logp(d^=k)\mathcal{L}_{\mathrm{IRDE}} = -\sum_{k=1}^{K} y_d^{(k)}\log p(\hat d = k)

where ydy_d is the one-hot target vector, and (x,y)Uniform(R)(x, y) \sim \mathrm{Uniform}(R). The IRDE loss is incorporated into the global objective:

L=Laction+λdLIRDE+λbLbbox+λpLpose\mathcal{L} = \mathcal{L}_{\mathrm{action}} + \lambda_{d} \mathcal{L}_{\mathrm{IRDE}} + \lambda_{b} \mathcal{L}_{\mathrm{bbox}} + \lambda_{p} \mathcal{L}_{\mathrm{pose}}

with weighting ratios (flow:depth:bbox:pose = 5:2:2:1) set for joint optimization.

This auxiliary depth supervision is claimed to focus the model's geometric attention on task-relevant locations, thereby accelerating convergence and improving spatial selectivity.

3. Model Architecture and Training Regimen

Dataflow and Modules

The model pipeline consists of visual tokenization, multi-modal token joint encoding, and hierarchical action prediction:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
function FORWARD(I_L, I_R, instruction):
    g = f_geom(I_L, I_R)             # Stereo geometry
    s = f_sem(I_L)                   # Monocular semantics
    h = f_fuse(g, s)                 # Fused tokens

    v_tokens = h
    l_tokens = tokenize_text(instruction)
    hlm_out = InternLM_1.8B(v_tokens, l_tokens)

    action_chunk = ActionExpert(hlm_out)

    if training:
        (x, y) = sample_point_in_interaction_region()
        depth_pred = DepthHead(hlm_out, (x, y))
        bbox_pred = BBoxHead(hlm_out)
        pose_pred = PoseHead(hlm_out)
        L = L_action(action_chunk) + L_depth(depth_pred, d_gt(x, y)) + \
            L_bbox(bbox_pred, bbox_gt) + L_pose(pose_pred, pose_gt)
        return action_chunk, L
    else:
        return action_chunk

Data Sources and Optimization

Training leverages:

  • Synthetic MuJoCo and Isaac-Sim dataset: 5 million 224x224 stereo trajectories (10 Hz, ±5% stereo baseline variation).
  • GRIT (internet-scale) for 2D auxiliary grounding.
  • Optimization via AdamW (learning rate 1.6×1041.6 \times 10^{-4}), batch size 384, 32 NVIDIA H800 GPUs, 160k steps.
  • Progressive action generation: box prediction → keyframe → trajectory.

This regimen facilitates both robust geometric grounding and effective language-conditioned policy learning in complex manipulation domains.

4. Empirical Performance and Ablation Analyses

Real-Robot Results

StereoVLA demonstrated substantial performance improvements over baselines on more than 450 real-robot trials across general pick-and-place and geometric alignment tasks:

Task StereoVLA Baselines
Bar @ 0°/45°/90° ~100%/95%/100% 60–80%
Small objects (1–2 cm) ~30% (1 try) 0%
Overall gain +33% abs.

This suggests the stereo-geometric approach directly addresses the spatial ambiguities found in monocular and depth-augmented VLA systems.

Ablations

  • Feature Source: Filtered cost volume VcV_c' with semantic fusion increased simulated success from 51% to 77% (compared to 27% for VcorrV_{\mathrm{corr}}, raw, without semantics). Adding semantics consistently yielded ~+26%.
  • Fusion Method: Channel concatenation outperformed sequence concatenation (+3%, half compute).
  • IRDE Sampling: Interaction-region depth sampling achieved 77% success versus 64% for uniform and 58% for no depth auxiliary.

Qualitative Insights

StereoVLA more reliably aligns the gripper with challenging object geometries (e.g., bar orientation) and attends to precise 2D locations necessary for manipulating small and medium objects.

5. Robustness to Camera Pose Variation

StereoVLA was evaluated under camera-pose perturbations using datasets with small, medium, and large spherical-shell-based pose changes.

Model Type Small Pose Medium Pose Large Pose
SpatialVLA-D (1-view) 24.6% 13.7% 6.8%
Front+wrist π₀.₅ 64.3% 56.5% 51.6%
Front+wrist GraspVLA 71.3% 63.4% 54.8%
Front+side GraspVLA 82.5% 55.7% 24.1%
StereoVLA 79.3% 71.9% 61.3%

StereoVLA exhibited the highest robustness under medium and large extrinsic perturbations. The results indicate that stereo-derived parallax cues are resilient to camera/object pose variation, whereas multi-view approaches degrade as cross-view geometry degrades.

A plausible implication is that stereo fusion provides a consistent geometric reference irrespective of modest to large baseline shifts, maintaining action reliability.

6. Theoretical and Practical Implications

Binocular disparity, as implemented in StereoVLA, delivers dense spatial gradients crucial for resolving depth ambiguities, especially in cluttered or visually complex environments. Integration with strong semantic priors (semantic-rich foundation models) allows precise alignment between geometric cues and language referents. The IRDE auxiliary further focuses supervised attention on task-relevant regions, improving the convergence rate and robustness of spatial reasoning.

Overall, StereoVLA demonstrates that mature stereo representations, when coupled with vision-language modeling and auxiliary region-focused depth supervision, yield state-of-the-art manipulation accuracy and reliability for generalist robotic systems, particularly under real-world sensor and pose variations (Deng et al., 26 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to StereoVLA.