StereoVLA: Stereo Vision-Language Action Model

Updated 2 January 2026

StereoVLA is a vision-language-action model that integrates stereo-derived geometric cues with semantic features to boost robotic manipulation accuracy.
The model fuses dual-camera stereo cost volumes with pretrained semantic signals via a Geometric-Semantic Feature Extraction module, ensuring precise depth estimation in interaction regions.
Its auxiliary IRDE task and robust design deliver notable gains—up to 33% improvement and resilience against camera pose variations—over monocular and multi-view baselines.

StereoVLA is a vision-language-action (VLA) model that explicitly leverages stereo vision, mimicking human binocular perception, to enhance spatial awareness and precision in robotic manipulation. The methodology strategically integrates advanced geometric feature processing from stereo image pairs with semantic signals from foundation models, supporting robust performance in diverse manipulation tasks, particularly where depth or fine-grained 3D structure is critical. Key contributions of StereoVLA include a Geometric-Semantic Feature Extraction module, a targeted Interaction-Region Depth Estimation (IRDE) auxiliary task, and systematic evaluations establishing strong improvements over leading monocular and multi-view baselines (Deng et al., 26 Dec 2025).

1. Geometric-Semantic Feature Extraction in StereoVLA

At the core of StereoVLA is the Geometric-Semantic Feature Extraction module, which processes synchronized left and right camera images

$I_L,\; I_R \in \mathbb{R}^{H\times W\times 3}$

to produce a fused representation encoding both stereo-derived geometric structure and monocular semantic context.

Geometric Feature Extraction

StereoVLA employs a shared-weight unary extractor $\Phi$ :

$f_L = \Phi(I_L),\quad f_R = \Phi(I_R)\,;\quad f_{L},f_{R}\in\mathbb{R}^{C\times \tfrac{H}{4}\times \tfrac{W}{4}}$

to generate spatially-reduced feature maps. Channel-wise concatenation forms a 4D stereo cost volume:

$V_c = \mathrm{concat}(f_L,\,f_R)\;\in\;\mathbb{R}^{2C\times \tfrac{D}{4}\times \tfrac{H}{4}\times \tfrac{W}{4}$

with $D$ as the maximum disparity range. A hybrid cost-filtering module $\Psi$ (attention plus convolution) produces filtered features:

$V_c' = \Psi(V_c)\quad\in\;\mathbb{R}^{C'\times \tfrac{D}{4}\times \tfrac{H}{4}\times \tfrac{W}{4}$

defining the dense geometric tensor $g = f_{\mathrm{geom}}(I_L, I_R) = V_c'$ .

Semantic Feature Extraction

To supplement depth cues with rich semantics, StereoVLA applies two pretrained semantic "foundation" heads to $I_L$ :

SigLIP: $s_{\mathrm{sig}} = f_{\mathrm{sig}}(I_L) \in \mathbb{R}^{K_1\times H_s\times W_s}$
DINOv2: $s_{\mathrm{dino}} = f_{\mathrm{dino}}(I_L) \in \mathbb{R}^{K_2\times H_s\times W_s}$

Combined, $s = f_{\mathrm{sem}}(I_L) = [s_{\mathrm{sig}}; s_{\mathrm{dino}}] \in \mathbb{R}^{(K_1+K_2)\times H_s\times W_s}$ .

Feature Fusion

Spatial pooling along the disparity axis aligns $g$ to semantic map size:

$g_p = \mathrm{Pool}(g)\;\in\;\mathbb{R}^{C''\times H_s\times W_s}$

Final fusion concatenates along channel dimension:

$u = [g_p;\,s_{\mathrm{sig}};\,s_{\mathrm{dino}} ] \in \mathbb{R}^{(C''+K_1+K_2)\times H_s\times W_s}$

The result is flattened and projected:

$h = f_{\mathrm{fuse}}(g,s) = \mathrm{MLP}(\mathrm{Flatten}(u)) \in \mathbb{R}^{N\times D_h}$

where $N=H_s W_s$ , $D_h$ is the token dimension. This token sequence forms the unified visual embedding for subsequent vision-language processing.

2. Auxiliary Interaction-Region Depth Estimation (IRDE)

StereoVLA introduces the Interaction-Region Depth Estimation (IRDE) as an auxiliary training task, focusing depth supervision on the areas most relevant for manipulation.

Task Definition and Loss

During training, a point $(x,y)$ is sampled within the "interaction region"

$R = \mathrm{BBox}_{\mathrm{object}} \cup \mathrm{BBox}_{\mathrm{gripper}}$

as computed on $I_L$ . The system predicts a depth value $\hat d(x,y)$ using the fused visual-language features, with ground-truth supervision $d_{\mathrm{gt}}(x,y)$ available from simulation.

Depth prediction is formulated as a classification over discretized bins, using cross-entropy:

$\mathcal{L}_{\mathrm{IRDE}} = -\sum_{k=1}^{K} y_d^{(k)}\log p(\hat d = k)$

where $y_d$ is the one-hot target vector, and $(x, y) \sim \mathrm{Uniform}(R)$ . The IRDE loss is incorporated into the global objective:

$\mathcal{L} = \mathcal{L}_{\mathrm{action}} + \lambda_{d} \mathcal{L}_{\mathrm{IRDE}} + \lambda_{b} \mathcal{L}_{\mathrm{bbox}} + \lambda_{p} \mathcal{L}_{\mathrm{pose}}$

with weighting ratios (flow:depth:bbox:pose = 5:2:2:1) set for joint optimization.

This auxiliary depth supervision is claimed to focus the model's geometric attention on task-relevant locations, thereby accelerating convergence and improving spatial selectivity.

3. Model Architecture and Training Regimen

Dataflow and Modules

The model pipeline consists of visual tokenization, multi-modal token joint encoding, and hierarchical action prediction:

function FORWARD(I_L, I_R, instruction):
    g = f_geom(I_L, I_R)             # Stereo geometry
    s = f_sem(I_L)                   # Monocular semantics
    h = f_fuse(g, s)                 # Fused tokens

    v_tokens = h
    l_tokens = tokenize_text(instruction)
    hlm_out = InternLM_1.8B(v_tokens, l_tokens)

    action_chunk = ActionExpert(hlm_out)

    if training:
        (x, y) = sample_point_in_interaction_region()
        depth_pred = DepthHead(hlm_out, (x, y))
        bbox_pred = BBoxHead(hlm_out)
        pose_pred = PoseHead(hlm_out)
        L = L_action(action_chunk) + L_depth(depth_pred, d_gt(x, y)) + \
            L_bbox(bbox_pred, bbox_gt) + L_pose(pose_pred, pose_gt)
        return action_chunk, L
    else:
        return action_chunk

Data Sources and Optimization

Training leverages:

Synthetic MuJoCo and Isaac-Sim dataset: 5 million 224x224 stereo trajectories (10 Hz, ±5% stereo baseline variation).
GRIT (internet-scale) for 2D auxiliary grounding.
Optimization via AdamW (learning rate $1.6 \times 10^{-4}$ ), batch size 384, 32 NVIDIA H800 GPUs, 160k steps.
Progressive action generation: box prediction → keyframe → trajectory.

This regimen facilitates both robust geometric grounding and effective language-conditioned policy learning in complex manipulation domains.

4. Empirical Performance and Ablation Analyses

Real-Robot Results

StereoVLA demonstrated substantial performance improvements over baselines on more than 450 real-robot trials across general pick-and-place and geometric alignment tasks:

Task	StereoVLA	Baselines
Bar @ 0°/45°/90°	~100%/95%/100%	60–80%
Small objects (1–2 cm)	~30% (1 try)	0%
Overall gain	+33% abs.	—

This suggests the stereo-geometric approach directly addresses the spatial ambiguities found in monocular and depth-augmented VLA systems.

Ablations

Feature Source: Filtered cost volume $V_c'$ with semantic fusion increased simulated success from 51% to 77% (compared to 27% for $V_{\mathrm{corr}}$ , raw, without semantics). Adding semantics consistently yielded ~+26%.
Fusion Method: Channel concatenation outperformed sequence concatenation (+3%, half compute).
IRDE Sampling: Interaction-region depth sampling achieved 77% success versus 64% for uniform and 58% for no depth auxiliary.

Qualitative Insights

StereoVLA more reliably aligns the gripper with challenging object geometries (e.g., bar orientation) and attends to precise 2D locations necessary for manipulating small and medium objects.

5. Robustness to Camera Pose Variation

StereoVLA was evaluated under camera-pose perturbations using datasets with small, medium, and large spherical-shell-based pose changes.

Model Type	Small Pose	Medium Pose	Large Pose
SpatialVLA-D (1-view)	24.6%	13.7%	6.8%
Front+wrist π₀.₅	64.3%	56.5%	51.6%
Front+wrist GraspVLA	71.3%	63.4%	54.8%
Front+side GraspVLA	82.5%	55.7%	24.1%
StereoVLA	79.3%	71.9%	61.3%

StereoVLA exhibited the highest robustness under medium and large extrinsic perturbations. The results indicate that stereo-derived parallax cues are resilient to camera/object pose variation, whereas multi-view approaches degrade as cross-view geometry degrades.

A plausible implication is that stereo fusion provides a consistent geometric reference irrespective of modest to large baseline shifts, maintaining action reliability.

6. Theoretical and Practical Implications

Binocular disparity, as implemented in StereoVLA, delivers dense spatial gradients crucial for resolving depth ambiguities, especially in cluttered or visually complex environments. Integration with strong semantic priors (semantic-rich foundation models) allows precise alignment between geometric cues and language referents. The IRDE auxiliary further focuses supervised attention on task-relevant regions, improving the convergence rate and robustness of spatial reasoning.

Overall, StereoVLA demonstrates that mature stereo representations, when coupled with vision-language modeling and auxiliary region-focused depth supervision, yield state-of-the-art manipulation accuracy and reliability for generalist robotic systems, particularly under real-world sensor and pose variations (Deng et al., 26 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to StereoVLA.