Any3D-VLA: 3D Fusion for VLA Models

Updated 7 February 2026

The paper introduces a modular approach that infuses explicit 3D geometric reasoning into VLA models to overcome 2D spatial limitations.
It employs a hybrid point-cloud strategy from simulated, real, and estimated sources to achieve robust sim-to-real transfer.
Empirical results demonstrate significant improvements in zero-shot robustness, manipulation accuracy, and generalization across diverse environments.

Any3D-VLA is a modular approach for integrating explicit 3D geometric reasoning into Vision-Language-Action (VLA) models through diverse point-cloud representations. It is designed to address the spatial reasoning limitations of 2D-only VLA architectures in domains requiring precise manipulation, occlusion handling, and viewpoint invariance. Any3D-VLA achieves robust sim-to-real transfer by training on hybrid point clouds from simulator, actual depth sensors, and model-estimated sources, and by performing lightweight patch-wise fusion into strong pre-trained 2D ViT token representations. This yields a domain-agnostic 3D–2D fused perceptual pipeline, resulting in tangible gains in zero-shot robustness, manipulation accuracy, and generalization across diverse environments (Fan et al., 31 Jan 2026).

1. Motivation and Problem Statement

While modern VLAs have demonstrated impressive capabilities in following natural language instructions and executing manipulation tasks, most models rely exclusively on 2D image backbones. This design causes notable failure cases in scenes involving fine objects, substantial occlusion, complex depth-scale variations, or challenging viewpoints. Standard workarounds, such as appending a monocular depth channel or leveraging implicit multi-view priors, do not effectively preserve metric information or spatial topology.

Explicit injection of 3D geometry via sparse point clouds is proposed as a solution. Advantages of point clouds versus 2D (or implicit 3D) representations include preservation of true metric scale, occlusion and topology cues, and the facility of spatial constraint application. However, scaling 3D-augmented VLAs is challenged by the scarcity of perfect point clouds across all settings and the domain gap among simulator-derived, sensor-derived, and model-estimated depth (Fan et al., 31 Jan 2026).

2. Architecture and Methodology

Any3D-VLA operates as a plug-in module for transformer-based VLA backbones. Its core pipeline comprises three stages: (i) lifting RGB(D) data into point clouds, (ii) encoding these clouds with a domain-agnostic 3D backbone, and (iii) patch-wise fusion into the 2D representation.

2.1 Point-Cloud Acquisition

Simulator-ground-truth: Depth rendered in simulation (e.g., IsaacSim, MuJoCo) is back-projected for perfect metric point clouds (256 × 256 resolution).
Real-sensor: Raw depth from hardware sensors (Intel RealSense D435) is used, typically noisier and with missing data.
Model-estimated: State-of-the-art single-frame depth estimators (UniDepthV2, Depth Anything 3) and 3D reconstruction models (MapAnything) provide estimated depths when real or simulated data are insufficient.
The back-projection formula for a pixel $(u,v,d)$ :

$x = \frac{(u - c_x)\,d}{f_x},\quad y = \frac{(v - c_y)\,d}{f_y},\quad z = d$

2.2 3D Compression and Domain-Agnostic Encoding

Raw point clouds (30–60k points) are voxel-downsampled (1 cm grid) to 3–8k points, each with attributes $(x,y,z,\;r,g,b,\;n_x,n_y,n_z)$ (where normals are fit locally).
Compressed clouds are encoded using a pre-trained 3D backbone (Concerto), freezing most weights and finetuning only the last 4 sparse-conv blocks, to produce per-point features $\mathbf{f}_i^{\rm 3D}\in\mathbb{R}^{1728}$ .

2D images are encoded with a frozen dual-ViT encoder (DINOv2, 1024-d; SigLIP, 1152-d), concatenated to give 2176-d patch tokens.
Each 3D point is projected to its corresponding 2D ViT patch index, and for each patch $j$ , the associated 3D features are averaged (or replaced by a learnable empty token if none assigned).
Features are linearly projected to match the 2D token dimension, yielding $\mathbf{h}_j^{\rm 3D}$ .
A gated residual fusion combines modalities:

$\mathbf{h}_j^{\rm fused} = \mathbf{h}_j^{\rm 2D} + \sigma(g)\,\mathrm{LayerNorm}(\mathrm{MLP}([\mathbf{h}_j^{\rm 2D};\mathbf{h}_j^{\rm 3D}]))$

where $g$ is a learnable gate (initialized so $\sigma(g)\ll 1$ ), controlling the 3D corrective offset to the high-quality 2D representations.

The fused sequence, along with language and proprioception tokens, is processed by the VLA transformer for action policy inference.

3. Optimization and Training Objectives

Any3D-VLA adopts standard sequence-generation and flow-matching losses for end-to-end optimization, without introducing explicit domain adaptation or 3D reconstruction losses.

3.1 Vision-LLM (VLM) Loss

For each sample, the VLM predicts $N_{\rm bbox}$ bounding-box tokens (and $N_{\rm gpose}$ grasp-pose tokens on synthetic data):

$\mathcal{L}_{S2} = -\sum_{n=1}^{N_{\rm bbox}} \log P_\theta(y_{{\rm bbox},n}|\mathbf{h}_{\rm fused}, x_{\rm text}, y_{\rm bbox, <n}}) - \mathbb{I}_{\rm synth} \sum_{n=1}^{N_{\rm gpose}} \log P_\theta(y_{{\rm gpose},n}|\mathbf{h}_{\rm fused}, x_{\rm text}, y_{\rm bbox}, y_{\rm gpose,<n}})$

where $\mathbb{I}_{\rm synth}$ indicates synthetic data.

3.2 Flow-Matching Loss

On synthetic data, a conditional action expert matches its learned vector field $v_t$ to the ground-truth flow field $u_t$ :

$\mathcal{L}_{S1} = \mathbb{I}_{\rm synth}\, \mathbb{E}_{t\sim \mathcal{U}[0,1]} \|v_t(A_t, \mathbf{h}_{\rm fused}, y_{\rm bbox}, y_{\rm gpose}) -u_t(A_t, A_0)\|_F^2$

The full objective is

$\mathcal{L}_{\rm total} = \lambda_1 \mathcal{L}_{S2} + \lambda_2 \mathcal{L}_{S1}$

with $\lambda_1 = \lambda_2 = 1.0$ .

Training achieves domain robustness by exposing the model to hybrid point clouds with varying noise, scale, and source characteristics during pre-training (Fan et al., 31 Jan 2026).

4. Training Datasets and Experimental Protocol

4.1 Multisource Data Regimen

Synthetic Pre-training: Objaverse LVIS subset (290 categories, 10,680 instances) in IsaacSim/MuJoCo with randomized clutter; expert trajectories; four interleaved depth sources: Simulator (30%), UniDepthV2 (30%), Depth Anything 3 (20%), MapAnything (20%).
Real-World Fine-Tuning: Franka Panda + Intel RealSense D435, 100 demonstrations per task for previously unseen real tasks (e.g., flower into vase, condiment cup placement).

4.2 Evaluation Metrics and Baselines

Baselines include π₀.₅ and GraspVLA (2D-only VLAs), SpatialVLA (implicit-3D via depth-informed backbones).
Evaluation metrics:
- Single-Trial Success Rate (SR)
- Test SR (up to three attempts)
- Grasp SR (any object)
- Zero-shot real-world SR across standard, scale/shape, viewpoint, and appearance-deprived challenges
- Task success after post-training
- LIBERO and CALVIN benchmark metrics

4.3 Empirical Performance

In simulation with perfect depth: Point-cloud-2D fusion reaches Single-Trial SR = 61.1%, surpassing best baseline (56.8%).
Zero-shot real: Any3D-VLA attains 62.5% overall SR (Setting 2 + Depth Anything 3) versus 33.3% for SpatialVLA.
Post-train on new real tasks: 93.3% success rate versus 53.3% for the strongest baseline.
LIBERO: +13.9% over GraspVLA; CALVIN: +0.71 tasks in sequence.
These results indicate superior spatial robustness, improved handling of appearance-deprived and occluded scenes, and enhanced viewpoint invariance (Fan et al., 31 Jan 2026).

5. Technical Innovations and Distinctions

Any3D-VLA introduces several notable techniques that distinguish it from prior work:

Hybrid Point-Cloud Pre-training: By interleaving simulator, sensor, and model-estimated point clouds, Any3D-VLA enables robust generalization and reduces overfitting to a specific domain or sensing modality.
Domain-Agnostic 3D Backbones: The use of Concerto as a unified 3D encoder (frozen with last layers finetuned) allows for shared 3D feature extraction across variable input sources.
Patch-Wise Gated Residual Fusion: Each 3D patch feature is treated as a corrective offset to the corresponding 2D token, modulated by a learnable gate $\sigma(g)$ . This minimizes adverse impacts on strong 2D features while actively correcting failure modes in challenging scenes.
Lightweight Integration: The architecture introduces minimal overhead and is broadly applicable to standard transformer-based VLAs. The backbone policy and autoregressive action inference remain unchanged, streamlining adoption and integration.
No Explicit Domain Adaptation Losses: Robustness is achieved through input diversity rather than adversarial or reconstruction-based domain-adaptation objectives.

6. Comparative Analysis and Implications

When compared to 2D-only or implicitly 3D VLAs, Any3D-VLA consistently outperforms baselines in both synthetic and real settings, particularly in spatially ambiguous or out-of-distribution scenes. Exposure to hybrid point clouds during training acts as strong sim-to-real data augmentation, reducing the need for expensive sensor hardware at test time.

A plausible implication is that future VLA systems in robotics, AR, and related embodied reasoning applications could benefit from modular, domain-agnostic 3D fusion designs even with partial or noisy depth. This approach offers a principled mechanism to endow foundation models with improved spatial generalization, topological awareness, and viewpoint robustness.

7. Broader Impact and Future Directions

Any3D-VLA sets a precedent for practical and scalable integration of explicit 3D representations in VLA models. Its framework is extensible to additional scene modalities, including tactile or force sensing, and potentially to active 3D reconstruction in long-horizon tasks. Applications are anticipated across household and warehouse robotics, digital-twin environments, and any setting where compact, robust spatial reasoning is required under wide domain variability (Fan et al., 31 Jan 2026). Future research may build on Any3D-VLA's approach to enable closed-loop control, direct dense scene reconstruction, or even fusion with generative world-model components.

Markdown Upgrade to Chat

References (1)

Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Any3D-VLA.

Any3D-VLA: 3D Fusion for VLA Models

1. Motivation and Problem Statement

2. Architecture and Methodology

2.1 Point-Cloud Acquisition

2.2 3D Compression and Domain-Agnostic Encoding

3. Optimization and Training Objectives

3.1 Vision-LLM (VLM) Loss

3.2 Flow-Matching Loss

4. Training Datasets and Experimental Protocol

4.1 Multisource Data Regimen

4.2 Evaluation Metrics and Baselines

4.3 Empirical Performance

5. Technical Innovations and Distinctions

6. Comparative Analysis and Implications

7. Broader Impact and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Any3D-VLA: 3D Fusion for VLA Models

1. Motivation and Problem Statement

2. Architecture and Methodology

2.1 Point-Cloud Acquisition

2.2 3D Compression and Domain-Agnostic Encoding

2.3 Patch-Wise Alignment and Multi-Modal Fusion

3. Optimization and Training Objectives

3.1 Vision-LLM (VLM) Loss

3.2 Flow-Matching Loss

4. Training Datasets and Experimental Protocol

4.1 Multisource Data Regimen

4.2 Evaluation Metrics and Baselines

4.3 Empirical Performance

5. Technical Innovations and Distinctions

6. Comparative Analysis and Implications

7. Broader Impact and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research