StereoVLA: Stereo Vision-Language Action Model
- StereoVLA is a vision-language-action model that integrates stereo-derived geometric cues with semantic features to boost robotic manipulation accuracy.
- The model fuses dual-camera stereo cost volumes with pretrained semantic signals via a Geometric-Semantic Feature Extraction module, ensuring precise depth estimation in interaction regions.
- Its auxiliary IRDE task and robust design deliver notable gains—up to 33% improvement and resilience against camera pose variations—over monocular and multi-view baselines.
StereoVLA is a vision-language-action (VLA) model that explicitly leverages stereo vision, mimicking human binocular perception, to enhance spatial awareness and precision in robotic manipulation. The methodology strategically integrates advanced geometric feature processing from stereo image pairs with semantic signals from foundation models, supporting robust performance in diverse manipulation tasks, particularly where depth or fine-grained 3D structure is critical. Key contributions of StereoVLA include a Geometric-Semantic Feature Extraction module, a targeted Interaction-Region Depth Estimation (IRDE) auxiliary task, and systematic evaluations establishing strong improvements over leading monocular and multi-view baselines (Deng et al., 26 Dec 2025).
1. Geometric-Semantic Feature Extraction in StereoVLA
At the core of StereoVLA is the Geometric-Semantic Feature Extraction module, which processes synchronized left and right camera images
to produce a fused representation encoding both stereo-derived geometric structure and monocular semantic context.
Geometric Feature Extraction
StereoVLA employs a shared-weight unary extractor :
to generate spatially-reduced feature maps. Channel-wise concatenation forms a 4D stereo cost volume:
$V_c = \mathrm{concat}(f_L,\,f_R)\;\in\;\mathbb{R}^{2C\times \tfrac{D}{4}\times \tfrac{H}{4}\times \tfrac{W}{4}$
with as the maximum disparity range. A hybrid cost-filtering module (attention plus convolution) produces filtered features:
$V_c' = \Psi(V_c)\quad\in\;\mathbb{R}^{C'\times \tfrac{D}{4}\times \tfrac{H}{4}\times \tfrac{W}{4}$
defining the dense geometric tensor .
Semantic Feature Extraction
To supplement depth cues with rich semantics, StereoVLA applies two pretrained semantic "foundation" heads to :
- SigLIP:
- DINOv2:
Combined, .
Feature Fusion
Spatial pooling along the disparity axis aligns to semantic map size:
Final fusion concatenates along channel dimension:
The result is flattened and projected:
where , is the token dimension. This token sequence forms the unified visual embedding for subsequent vision-language processing.
2. Auxiliary Interaction-Region Depth Estimation (IRDE)
StereoVLA introduces the Interaction-Region Depth Estimation (IRDE) as an auxiliary training task, focusing depth supervision on the areas most relevant for manipulation.
Task Definition and Loss
During training, a point is sampled within the "interaction region"
as computed on . The system predicts a depth value using the fused visual-language features, with ground-truth supervision available from simulation.
Depth prediction is formulated as a classification over discretized bins, using cross-entropy:
where is the one-hot target vector, and . The IRDE loss is incorporated into the global objective:
with weighting ratios (flow:depth:bbox:pose = 5:2:2:1) set for joint optimization.
This auxiliary depth supervision is claimed to focus the model's geometric attention on task-relevant locations, thereby accelerating convergence and improving spatial selectivity.
3. Model Architecture and Training Regimen
Dataflow and Modules
The model pipeline consists of visual tokenization, multi-modal token joint encoding, and hierarchical action prediction:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
function FORWARD(I_L, I_R, instruction):
g = f_geom(I_L, I_R) # Stereo geometry
s = f_sem(I_L) # Monocular semantics
h = f_fuse(g, s) # Fused tokens
v_tokens = h
l_tokens = tokenize_text(instruction)
hlm_out = InternLM_1.8B(v_tokens, l_tokens)
action_chunk = ActionExpert(hlm_out)
if training:
(x, y) = sample_point_in_interaction_region()
depth_pred = DepthHead(hlm_out, (x, y))
bbox_pred = BBoxHead(hlm_out)
pose_pred = PoseHead(hlm_out)
L = L_action(action_chunk) + L_depth(depth_pred, d_gt(x, y)) + \
L_bbox(bbox_pred, bbox_gt) + L_pose(pose_pred, pose_gt)
return action_chunk, L
else:
return action_chunk |
Data Sources and Optimization
Training leverages:
- Synthetic MuJoCo and Isaac-Sim dataset: 5 million 224x224 stereo trajectories (10 Hz, ±5% stereo baseline variation).
- GRIT (internet-scale) for 2D auxiliary grounding.
- Optimization via AdamW (learning rate ), batch size 384, 32 NVIDIA H800 GPUs, 160k steps.
- Progressive action generation: box prediction → keyframe → trajectory.
This regimen facilitates both robust geometric grounding and effective language-conditioned policy learning in complex manipulation domains.
4. Empirical Performance and Ablation Analyses
Real-Robot Results
StereoVLA demonstrated substantial performance improvements over baselines on more than 450 real-robot trials across general pick-and-place and geometric alignment tasks:
| Task | StereoVLA | Baselines |
|---|---|---|
| Bar @ 0°/45°/90° | ~100%/95%/100% | 60–80% |
| Small objects (1–2 cm) | ~30% (1 try) | 0% |
| Overall gain | +33% abs. | — |
This suggests the stereo-geometric approach directly addresses the spatial ambiguities found in monocular and depth-augmented VLA systems.
Ablations
- Feature Source: Filtered cost volume with semantic fusion increased simulated success from 51% to 77% (compared to 27% for , raw, without semantics). Adding semantics consistently yielded ~+26%.
- Fusion Method: Channel concatenation outperformed sequence concatenation (+3%, half compute).
- IRDE Sampling: Interaction-region depth sampling achieved 77% success versus 64% for uniform and 58% for no depth auxiliary.
Qualitative Insights
StereoVLA more reliably aligns the gripper with challenging object geometries (e.g., bar orientation) and attends to precise 2D locations necessary for manipulating small and medium objects.
5. Robustness to Camera Pose Variation
StereoVLA was evaluated under camera-pose perturbations using datasets with small, medium, and large spherical-shell-based pose changes.
| Model Type | Small Pose | Medium Pose | Large Pose |
|---|---|---|---|
| SpatialVLA-D (1-view) | 24.6% | 13.7% | 6.8% |
| Front+wrist π₀.₅ | 64.3% | 56.5% | 51.6% |
| Front+wrist GraspVLA | 71.3% | 63.4% | 54.8% |
| Front+side GraspVLA | 82.5% | 55.7% | 24.1% |
| StereoVLA | 79.3% | 71.9% | 61.3% |
StereoVLA exhibited the highest robustness under medium and large extrinsic perturbations. The results indicate that stereo-derived parallax cues are resilient to camera/object pose variation, whereas multi-view approaches degrade as cross-view geometry degrades.
A plausible implication is that stereo fusion provides a consistent geometric reference irrespective of modest to large baseline shifts, maintaining action reliability.
6. Theoretical and Practical Implications
Binocular disparity, as implemented in StereoVLA, delivers dense spatial gradients crucial for resolving depth ambiguities, especially in cluttered or visually complex environments. Integration with strong semantic priors (semantic-rich foundation models) allows precise alignment between geometric cues and language referents. The IRDE auxiliary further focuses supervised attention on task-relevant regions, improving the convergence rate and robustness of spatial reasoning.
Overall, StereoVLA demonstrates that mature stereo representations, when coupled with vision-language modeling and auxiliary region-focused depth supervision, yield state-of-the-art manipulation accuracy and reliability for generalist robotic systems, particularly under real-world sensor and pose variations (Deng et al., 26 Dec 2025).