Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion (2506.05285v1)

Published 5 Jun 2025 in cs.CV

Abstract: 3D shape completion has broad applications in robotics, digital twin reconstruction, and extended reality (XR). Although recent advances in 3D object and scene completion have achieved impressive results, existing methods lack 3D consistency, are computationally expensive, and struggle to capture sharp object boundaries. Our work (RaySt3R) addresses these limitations by recasting 3D shape completion as a novel view synthesis problem. Specifically, given a single RGB-D image and a novel viewpoint (encoded as a collection of query rays), we train a feedforward transformer to predict depth maps, object masks, and per-pixel confidence scores for those query rays. RaySt3R fuses these predictions across multiple query views to reconstruct complete 3D shapes. We evaluate RaySt3R on synthetic and real-world datasets, and observe it achieves state-of-the-art performance, outperforming the baselines on all datasets by up to 44% in 3D chamfer distance. Project page: https://rayst3r.github.io

Summary

  • The paper introduces a novel transformer-based approach to predict depth maps and masks for zero-shot 3D shape completion from a single RGB-D image.
  • It leverages a large synthetic dataset and DINOv2 features, achieving up to 44% lower Chamfer Distance and improved F1 scores compared to state-of-the-art baselines.
  • Its efficient view merging strategy aggregates predictions from multiple novel viewpoints, enabling robust, real-time 3D reconstruction in robotics and XR applications.

"RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion" (2506.05285) presents a novel method for 3D shape completion from a single foreground-masked RGB-D image, particularly targeting multi-object cluttered real-world scenes relevant to robotics and extended reality (XR) applications. The core idea is to reframe 3D shape completion as a novel view synthesis problem, where a model predicts depth maps and masks for new viewpoints, and these predictions are then aggregated to reconstruct the full 3D geometry.

Existing methods for 3D shape completion often struggle with 3D consistency, high computational cost, capturing sharp details, or handling complex multi-object scenes in a zero-shot manner. RaySt3R addresses these by training a feedforward transformer to predict view-specific depth, object mask, and per-pixel confidence scores.

Approach and Architecture

The RaySt3R network takes a single RGB-D image with a foreground mask and a query viewpoint (encoded as a ray map) as input. The input depth map is unprojected to a point map XinputX^{\text{input}}. Both the input point map (transformed to the query camera frame XcontextX^{\text{context}}) and the query ray map RR are processed through self-attention layers to obtain features (Fpoint_map,FrayF^{\text{point\_map}}, F^{\text{ray}}). Visual features FDINOF^{\text{DINO}} are extracted from the masked input RGB image using a frozen DINOv2 (2304.07193) encoder, concatenating features from multiple intermediate layers (specifically 4, 11, 17, 23) and projecting them to the ViT token size. Cross-attention layers combine the ray features (FrayF^{\text{ray}}, as queries) with concatenated point map and DINOv2 features (concat(Fpoint_map,FDINO)\text{concat}(F^{\text{point\_map}}, F^{\text{DINO}}), as keys). Finally, two separate DPT heads (2103.13413) predict the depth map along with a confidence score, and the object mask for the queried view.

The architecture draws inspiration from multi-view transformers like DUSt3R (2403.15707), but uniquely adapts it to the single-view shape completion problem by replacing the second input image with a query ray map.

Training

RaySt3R is trained on a large-scale synthetic dataset comprising 1.1 million scenes and 12 million views, curated from FoundationPose (2403.17823) and OctMAE (2402.16952) data. The training objective involves predicting confidence-aware depth maps and object masks for novel views. The total loss is a weighted sum of a confidence-aware depth loss Ldepth\mathcal L_\text{depth} and a binary cross-entropy mask loss Lmask\mathcal L_\text{mask}:

$\mathcal L_{\text{total} = \mathcal L_\text{depth} + \lambda_\text{mask} \mathcal L_\text{mask}$

The depth loss, inspired by DUSt3R (2403.15707), incorporates a confidence score Ci,jC_{i,j} predicted for each pixel (i,j)(i,j):

Ldepth=i,jMi,jgt(Ci,jdi,jdi,jgt2αlogCi,j)\mathcal L_\text{depth} = \sum_{i,j} M^\text{gt}_{i,j} \left( C_{i,j} \left\lVert d_{i,j} - d_{i,j}^\text{gt} \right\rVert_2 - \alpha \log C_{i,j} \right)

The confidence Ci,jC_{i,j} is computed as 1+exp(Ci,j)1 + \exp(C'_{i,j}), where Ci,jC'_{i,j} is the raw network output, ensuring positivity and enabling unsupervised confidence learning. The mask loss Lmask\mathcal L_\text{mask} is a standard binary cross-entropy loss on the predicted mask mi,jm_{i,j} against the ground truth mi,jgtm^\text{gt}_{i,j}. Data augmentation, including Gaussian noise, holes, pixel shifts for depth, and color/noise variations for RGB, is applied to bridge the sim-to-real gap. The model was trained for 18 epochs on 8x 80GB A100 GPUs.

Inference and Prediction Merging

During inference, RaySt3R predicts 3D points by querying novel views sampled on a sphere around the bounding box of the input object. A tunable radius parameter λbb\lambda_{bb} and clipping based on camera distance λcam\lambda_{cam} are used for sampling. The predicted depth maps from these novel views are unprojected to 3D points. To create a complete 3D shape, these point predictions are merged based on several criteria:

  1. Occlusion Handling: Points visible in a novel view are filtered if they would have been occluded by the foreground objects in the input view according to the input depth map DinputD^\text{input} and mask $M^{\text{input}$. A point qq from a novel view nn is masked if its projection pp in the input view has a depth (p)z(p)_z greater than DinputD^\text{input} at the corresponding pixel and is within the input mask MinputM^{\text{input}}.
  2. Predicted Masks: The binary object mask predicted by RaySt3R for the novel view is thresholded (at 0.5) to filter points likely not belonging to the foreground object.
  3. Confidence Scores: The per-pixel confidence scores predicted by RaySt3R are used to further filter out unreliable points by thresholding $c^\text{#1}_{i,j}$ at a value τ\tau (set to 5 in experiments). This helps reduce noise and edge bleeding.

The final reconstruction is the aggregate of all points from all novel views that pass these filtering steps. Inference takes less than 1.2 seconds on a single RTX 4090 GPU when querying 22 views.

Evaluation and Results

RaySt3R was evaluated zero-shot on synthetic (OctMAE (2402.16952)) and real-world (YCB-Video (1805.07427), HOPE (2207.07128), HomebrewedDB (1910.04020)) datasets using Chamfer Distance (CD) and F1-Score@10mm (F1) metrics.

Quantitatively, RaySt3R significantly outperforms state-of-the-art baselines, including volumetric methods like OctMAE (2402.16952) and view-based methods like LaRI (2504.18424), Unique3D (2405.20343), and TRELLIS (2412.01506), as well as modular pipelines like SceneComplete (2410.23643). RaySt3R achieves the lowest CD and highest F1 across all evaluated datasets, with up to 44% lower CD compared to the best baseline. It also demonstrates the lowest standard deviation in CD across real-world datasets, indicating more consistent performance.

Qualitative results show that RaySt3R produces sharp, geometrically accurate, and complete 3D shapes, recovering the geometry of full objects in cluttered scenes despite only being trained on synthetic data. Baselines often exhibit oversmoothed predictions (OctMAE (2402.16952)), struggle with object placement and aspect ratios (LaRI (2504.18424), Unique3D (2405.20343), TRELLIS (2412.01506)), or are brittle to input mask quality (SceneComplete (2410.23643)).

Implementation Considerations and Ablations

  • Computational Requirements: Training is computationally intensive, requiring multiple high-end GPUs (8x 80GB A100). Inference is fast enough for real-time robotics/XR applications on a single modern GPU (< 1.2s).
  • Data Dependency: Training on a large, diverse synthetic dataset with appropriate augmentation is crucial for zero-shot generalization to real-world scenes. Ablations confirm the importance of data scale, diversity, and data augmentation.
  • Network Architecture: Using a larger ViT model and incorporating DINOv2 (2304.07193) features improve performance.
  • Merging Strategy: Each component of the view merging strategy (querying input view, occlusion masking, predicted masks, confidence filtering) contributes to performance, with predicted masks having the largest impact.
  • Confidence Filtering: The learned confidence scores provide a practical way to trade off accuracy (higher threshold reduces outliers) and completeness (lower threshold includes more points).
  • Input Mask Sensitivity: RaySt3R is more robust to false positives than false negatives in the input foreground mask. Obtaining high-quality input masks is important, though the method shows some resilience.
  • Baseline Alignment: The paper details the process of aligning canonical-space predictions of baselines (Unique3D (2405.20343), LaRI (2504.18424), TRELLIS (2412.01506)) for evaluation, showing that alignment parameters (like rotation search steps and initial scaling) impact their reported CD. RaySt3R predicts directly in the input camera frame, avoiding this step.

Conclusion

RaySt3R demonstrates that recasting single-view 3D shape completion as novel view synthesis and employing a transformer-based architecture trained on large-scale synthetic data is highly effective. The confidence-aware predictions and sophisticated merging strategy enable state-of-the-art zero-shot performance on challenging real-world multi-object scenes, making it a promising approach for applications requiring robust 3D perception. Future work includes exploring real-world data training and alternative architectures like diffusion transformers.

Github Logo Streamline Icon: https://streamlinehq.com