Papers
Topics
Authors
Recent
2000 character limit reached

Any3D-VLA: 3D Fusion for VLA Models

Updated 7 February 2026
  • The paper introduces a modular approach that infuses explicit 3D geometric reasoning into VLA models to overcome 2D spatial limitations.
  • It employs a hybrid point-cloud strategy from simulated, real, and estimated sources to achieve robust sim-to-real transfer.
  • Empirical results demonstrate significant improvements in zero-shot robustness, manipulation accuracy, and generalization across diverse environments.

Any3D-VLA is a modular approach for integrating explicit 3D geometric reasoning into Vision-Language-Action (VLA) models through diverse point-cloud representations. It is designed to address the spatial reasoning limitations of 2D-only VLA architectures in domains requiring precise manipulation, occlusion handling, and viewpoint invariance. Any3D-VLA achieves robust sim-to-real transfer by training on hybrid point clouds from simulator, actual depth sensors, and model-estimated sources, and by performing lightweight patch-wise fusion into strong pre-trained 2D ViT token representations. This yields a domain-agnostic 3D–2D fused perceptual pipeline, resulting in tangible gains in zero-shot robustness, manipulation accuracy, and generalization across diverse environments (Fan et al., 31 Jan 2026).

1. Motivation and Problem Statement

While modern VLAs have demonstrated impressive capabilities in following natural language instructions and executing manipulation tasks, most models rely exclusively on 2D image backbones. This design causes notable failure cases in scenes involving fine objects, substantial occlusion, complex depth-scale variations, or challenging viewpoints. Standard workarounds, such as appending a monocular depth channel or leveraging implicit multi-view priors, do not effectively preserve metric information or spatial topology.

Explicit injection of 3D geometry via sparse point clouds is proposed as a solution. Advantages of point clouds versus 2D (or implicit 3D) representations include preservation of true metric scale, occlusion and topology cues, and the facility of spatial constraint application. However, scaling 3D-augmented VLAs is challenged by the scarcity of perfect point clouds across all settings and the domain gap among simulator-derived, sensor-derived, and model-estimated depth (Fan et al., 31 Jan 2026).

2. Architecture and Methodology

Any3D-VLA operates as a plug-in module for transformer-based VLA backbones. Its core pipeline comprises three stages: (i) lifting RGB(D) data into point clouds, (ii) encoding these clouds with a domain-agnostic 3D backbone, and (iii) patch-wise fusion into the 2D representation.

2.1 Point-Cloud Acquisition

  • Simulator-ground-truth: Depth rendered in simulation (e.g., IsaacSim, MuJoCo) is back-projected for perfect metric point clouds (256 × 256 resolution).
  • Real-sensor: Raw depth from hardware sensors (Intel RealSense D435) is used, typically noisier and with missing data.
  • Model-estimated: State-of-the-art single-frame depth estimators (UniDepthV2, Depth Anything 3) and 3D reconstruction models (MapAnything) provide estimated depths when real or simulated data are insufficient.
  • The back-projection formula for a pixel (u,v,d)(u,v,d):

x=(ucx)dfx,y=(vcy)dfy,z=dx = \frac{(u - c_x)\,d}{f_x},\quad y = \frac{(v - c_y)\,d}{f_y},\quad z = d

2.2 3D Compression and Domain-Agnostic Encoding

  • Raw point clouds (30–60k points) are voxel-downsampled (1 cm grid) to 3–8k points, each with attributes (x,y,z,  r,g,b,  nx,ny,nz)(x,y,z,\;r,g,b,\;n_x,n_y,n_z) (where normals are fit locally).
  • Compressed clouds are encoded using a pre-trained 3D backbone (Concerto), freezing most weights and finetuning only the last 4 sparse-conv blocks, to produce per-point features fi3DR1728\mathbf{f}_i^{\rm 3D}\in\mathbb{R}^{1728}.

2.3 Patch-Wise Alignment and Multi-Modal Fusion

  • 2D images are encoded with a frozen dual-ViT encoder (DINOv2, 1024-d; SigLIP, 1152-d), concatenated to give 2176-d patch tokens.
  • Each 3D point is projected to its corresponding 2D ViT patch index, and for each patch jj, the associated 3D features are averaged (or replaced by a learnable empty token if none assigned).
  • Features are linearly projected to match the 2D token dimension, yielding hj3D\mathbf{h}_j^{\rm 3D}.
  • A gated residual fusion combines modalities:

hjfused=hj2D+σ(g)LayerNorm(MLP([hj2D;hj3D]))\mathbf{h}_j^{\rm fused} = \mathbf{h}_j^{\rm 2D} + \sigma(g)\,\mathrm{LayerNorm}(\mathrm{MLP}([\mathbf{h}_j^{\rm 2D};\mathbf{h}_j^{\rm 3D}]))

where gg is a learnable gate (initialized so σ(g)1\sigma(g)\ll 1), controlling the 3D corrective offset to the high-quality 2D representations.

  • The fused sequence, along with language and proprioception tokens, is processed by the VLA transformer for action policy inference.

3. Optimization and Training Objectives

Any3D-VLA adopts standard sequence-generation and flow-matching losses for end-to-end optimization, without introducing explicit domain adaptation or 3D reconstruction losses.

3.1 Vision-LLM (VLM) Loss

  • For each sample, the VLM predicts NbboxN_{\rm bbox} bounding-box tokens (and NgposeN_{\rm gpose} grasp-pose tokens on synthetic data):

$\mathcal{L}_{S2} = -\sum_{n=1}^{N_{\rm bbox}} \log P_\theta(y_{{\rm bbox},n}|\mathbf{h}_{\rm fused}, x_{\rm text}, y_{\rm bbox, <n}}) - \mathbb{I}_{\rm synth} \sum_{n=1}^{N_{\rm gpose}} \log P_\theta(y_{{\rm gpose},n}|\mathbf{h}_{\rm fused}, x_{\rm text}, y_{\rm bbox}, y_{\rm gpose,<n}})$

where Isynth\mathbb{I}_{\rm synth} indicates synthetic data.

3.2 Flow-Matching Loss

  • On synthetic data, a conditional action expert matches its learned vector field vtv_t to the ground-truth flow field utu_t:

LS1=IsynthEtU[0,1]vt(At,hfused,ybbox,ygpose)ut(At,A0)F2\mathcal{L}_{S1} = \mathbb{I}_{\rm synth}\, \mathbb{E}_{t\sim \mathcal{U}[0,1]} \|v_t(A_t, \mathbf{h}_{\rm fused}, y_{\rm bbox}, y_{\rm gpose}) -u_t(A_t, A_0)\|_F^2

  • The full objective is

Ltotal=λ1LS2+λ2LS1\mathcal{L}_{\rm total} = \lambda_1 \mathcal{L}_{S2} + \lambda_2 \mathcal{L}_{S1}

with λ1=λ2=1.0\lambda_1 = \lambda_2 = 1.0.

Training achieves domain robustness by exposing the model to hybrid point clouds with varying noise, scale, and source characteristics during pre-training (Fan et al., 31 Jan 2026).

4. Training Datasets and Experimental Protocol

4.1 Multisource Data Regimen

  • Synthetic Pre-training: Objaverse LVIS subset (290 categories, 10,680 instances) in IsaacSim/MuJoCo with randomized clutter; expert trajectories; four interleaved depth sources: Simulator (30%), UniDepthV2 (30%), Depth Anything 3 (20%), MapAnything (20%).
  • Real-World Fine-Tuning: Franka Panda + Intel RealSense D435, 100 demonstrations per task for previously unseen real tasks (e.g., flower into vase, condiment cup placement).

4.2 Evaluation Metrics and Baselines

  • Baselines include π₀.₅ and GraspVLA (2D-only VLAs), SpatialVLA (implicit-3D via depth-informed backbones).
  • Evaluation metrics:
    • Single-Trial Success Rate (SR)
    • Test SR (up to three attempts)
    • Grasp SR (any object)
    • Zero-shot real-world SR across standard, scale/shape, viewpoint, and appearance-deprived challenges
    • Task success after post-training
    • LIBERO and CALVIN benchmark metrics

4.3 Empirical Performance

  • In simulation with perfect depth: Point-cloud-2D fusion reaches Single-Trial SR = 61.1%, surpassing best baseline (56.8%).
  • Zero-shot real: Any3D-VLA attains 62.5% overall SR (Setting 2 + Depth Anything 3) versus 33.3% for SpatialVLA.
  • Post-train on new real tasks: 93.3% success rate versus 53.3% for the strongest baseline.
  • LIBERO: +13.9% over GraspVLA; CALVIN: +0.71 tasks in sequence.
  • These results indicate superior spatial robustness, improved handling of appearance-deprived and occluded scenes, and enhanced viewpoint invariance (Fan et al., 31 Jan 2026).

5. Technical Innovations and Distinctions

Any3D-VLA introduces several notable techniques that distinguish it from prior work:

  1. Hybrid Point-Cloud Pre-training: By interleaving simulator, sensor, and model-estimated point clouds, Any3D-VLA enables robust generalization and reduces overfitting to a specific domain or sensing modality.
  2. Domain-Agnostic 3D Backbones: The use of Concerto as a unified 3D encoder (frozen with last layers finetuned) allows for shared 3D feature extraction across variable input sources.
  3. Patch-Wise Gated Residual Fusion: Each 3D patch feature is treated as a corrective offset to the corresponding 2D token, modulated by a learnable gate σ(g)\sigma(g). This minimizes adverse impacts on strong 2D features while actively correcting failure modes in challenging scenes.
  4. Lightweight Integration: The architecture introduces minimal overhead and is broadly applicable to standard transformer-based VLAs. The backbone policy and autoregressive action inference remain unchanged, streamlining adoption and integration.
  5. No Explicit Domain Adaptation Losses: Robustness is achieved through input diversity rather than adversarial or reconstruction-based domain-adaptation objectives.

6. Comparative Analysis and Implications

When compared to 2D-only or implicitly 3D VLAs, Any3D-VLA consistently outperforms baselines in both synthetic and real settings, particularly in spatially ambiguous or out-of-distribution scenes. Exposure to hybrid point clouds during training acts as strong sim-to-real data augmentation, reducing the need for expensive sensor hardware at test time.

A plausible implication is that future VLA systems in robotics, AR, and related embodied reasoning applications could benefit from modular, domain-agnostic 3D fusion designs even with partial or noisy depth. This approach offers a principled mechanism to endow foundation models with improved spatial generalization, topological awareness, and viewpoint robustness.

7. Broader Impact and Future Directions

Any3D-VLA sets a precedent for practical and scalable integration of explicit 3D representations in VLA models. Its framework is extensible to additional scene modalities, including tactile or force sensing, and potentially to active 3D reconstruction in long-horizon tasks. Applications are anticipated across household and warehouse robotics, digital-twin environments, and any setting where compact, robust spatial reasoning is required under wide domain variability (Fan et al., 31 Jan 2026). Future research may build on Any3D-VLA's approach to enable closed-loop control, direct dense scene reconstruction, or even fusion with generative world-model components.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Any3D-VLA.