Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

157 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

RaySt3R: Zero-Shot 3D Shape Completion

Updated 30 June 2025

RaySt3R is a transformer-based framework for zero-shot 3D shape completion, synthesizing novel depth views from a single RGB-D input for robotics, digital twin reconstruction, and extended reality.
It reformulates geometric completion as a novel-view synthesis problem by leveraging cross-attention between query rays and fused RGB-D and point cloud features.
The method achieves state-of-the-art performance by reducing Chamfer distance by up to 44%, ensuring high-fidelity, sharp-boundary reconstructions in both synthetic and real-world scenarios.

RaySt3R is a transformer-based framework for zero-shot 3D object shape completion from a single RGB-D view, developed to address challenges in robotics, digital twin reconstruction, and extended reality (XR). The method views geometric completion as a novel-view synthesis problem, predicting depth maps, segmentation masks, and per-pixel confidences for virtual camera rays sampling the unobserved portions of an object. RaySt3R achieves state-of-the-art geometric accuracy and boundary sharpness on both synthetic and real-world datasets, exceeding baseline performance by up to 44% in Chamfer distance. Its design enables efficient, high-fidelity 3D completion suitable for scenarios demanding sharp object boundaries, 3D consistency, fast inference, and broad generalization.

1. Model Architecture

RaySt3R employs a modular vision transformer (ViT) pipeline combining pretrained RGB-D features with dense geometric reasoning:

Inputs: The model receives (a) a foreground-masked RGB-D image with camera intrinsics, and (b) a set of query views, each parameterized by a regular 2D grid of camera rays (ray maps).
Feature Extraction: The foreground-masked RGB is encoded with a frozen DINOv2 ViT-L backbone, extracting intermediate features from multiple layers (e.g., 4, 11, 17, 23), aggregating rich contextual cues.
World-to-Camera Transformation: The input depth map is unprojected to a 3D point cloud in world space and reprojected to each query camera frame (point map). Simultaneously, the query view is encoded as a ray map describing each pixel direction.
Self-Attention Blocks: Self-attention is applied separately to the point map and query ray map, enabling spatial and directional context. Foreground–background masking is handled via learnable tokens.
Cross-Attention: For each query view, a transformer block uses the query ray features as queries and concatenated point map plus DINOv2 features as keys/values, supporting long-range context propagation between observed and novel viewpoints.
Heads: The output is decoded by a pair of DPT heads: one predicts per-pixel depth and associated confidence, the other predicts a per-pixel object mask.

The backbone architecture commonly uses ViT-B (patch size 16, embed dim 768, 12 heads, 12 cross-attention layers, and 4 self-attention layers for both ray and context tokens).

2. Methodology: Shape Completion as View Synthesis

RaySt3R recasts 3D shape completion as predicting depth images in novel camera views:

Task Definition: Given a masked RGB-D input and an arbitrary set of virtual camera rays (the “novel view(s)”), infer the depth, object mask, and confidence along each ray, thus predicting what would be observed from new perspectives.
Fusion for 3D Completion: By synthesizing depth maps and object masks (with confidences) for multiple query views sampled around the object, and merging visible, high-confidence points, RaySt3R reconstructs a globally consistent 3D shape.
Sampling Strategy: Query cameras are distributed uniformly across a viewing sphere, excluding viewpoints degenerate with the input.

Loss Functions:

Depth loss is confidence-weighted:

$\mathcal{L}_\text{depth} = \sum_{i,j} M^{\mathrm{gt}}_{i,j} [C_{i,j} \| d_{i,j} - d^{\mathrm{gt}}_{i,j} \|_2 - \alpha \log C_{i,j}]$

Mask loss is binary cross-entropy:

$\mathcal{L}_\text{mask} = \sum_{i,j} -m_{i,j}^{\mathrm{gt}} \log m_{i,j} - (1 - m_{i,j}^{\mathrm{gt}}) \log (1 - m_{i,j})$

Total loss: Weighted combination of both.

View merging: Candidate points are retained for the 3D completion only if:

Not occluded by the original view (tested via visibility and input mask),
Pass binary mask threshold ( $m_{i,j} > 0.5$ ),
Have high confidence ( $c_{i,j} > \tau$ ).

3. Performance Evaluation and Results

RaySt3R achieves state-of-the-art accuracy and consistency on real and synthetic datasets (OctMAE, YCB-Video, HomebrewedDB, HOPE):

Metrics Used: Chamfer distance (averaged nearest-neighbor error), and F1 score at 10 mm threshold (matching predicted and ground-truth points within 10 mm).
Quantitative Results: On all major benchmarks, RaySt3R reduces Chamfer distance by 20–44% over the prior best method (OctMAE); its F1 scores also lead all baselines.

Dataset	Chamfer Distance (mm, ↓)	F1 Score (↑)
OctMAE	5.21	0.893
YCB-Video	3.56	0.930
HomebrewedDB	4.75	0.889
HOPE	3.92	0.926

Qualitative Results: Reconstructions display sharp, well-aligned boundaries and preserve fine details; prior baselines typically yield over-smoothed or incomplete geometry.
Efficiency: Inference time is under 1.2 seconds on a single GPU.

4. Application Domains

RaySt3R supports diverse use cases:

Robotics: Enables reliable grasping and manipulation by predicting full object surfaces from partial/occluded observations in cluttered environments. Benefits downstream tasks such as motion planning and mechanical search.
Digital Twins: Accurately reconstructs high-fidelity object or scene twins for remote inspection, simulation, or asset creation.
Extended Reality (XR): Real-time, sharp geometry completion enhances AR/VR overlays, interaction realism, and occlusion effects.
General 3D Modeling: Fast, consistent geometric completion is valuable for survey, mapping, and reconstruction tasks in both academic and industrial settings.

Sharp boundary prediction and confidence-aware fusion are explicitly highlighted as critical differentiators for practical deployment.

5. Limitations and Remaining Challenges

The method addresses several shortcomings of prior shape completion approaches but acknowledges key challenges:

Foreground Mask Dependency: Performance strongly depends on the quality of the foreground mask in the input. High false-negative rates (excluding true object points) can severely degrade completion quality; the system tolerates false positives more robustly.
Generalization to Real Data: Despite strong synthetic-to-real transfer, the absence of real-data fine-tuning may limit robustness to sensor noise, unmodeled artifacts, or out-of-distribution objects.
Handling Occlusion: While designed to avoid merging occluded predictions (by visibility and mask tests), heavy clutter or erroneous masks can cause residual artifacts.
Failure Modes: The approach may yield poor results if the input mask is inadequate or if the object has minimal visible area from the input view.

6. Future Directions

Several avenues are identified for improvement and extension:

Real-Data Training: Incorporating real-world RGB-D scans in training may improve noise robustness and adaptation to practical settings.
Model Scaling: Expanding backbone size or training corpus (e.g., with larger ViT architectures or diffusion-based transformers) could further enhance accuracy and generalization.
Reduction of Mask Reliance: Novel methods to infer shape without precise input masking—such as attention-based foreground identification—could increase applicability.
Optimized View Fusion: Improving the confidence-based selection and fusion of predictions across query views may yield even sharper and more complete reconstructions.
Expanded Task Scope: Extending the approach to scene-scale completion, more complex occlusion, or semantic segmentation tasks remains an open research question.

RaySt3R constitutes a significant development in transformer-based, geometry-aware vision, advancing the state of the art in single-view 3D shape completion. Its architectural focus on cross-attention between query rays and both visual and geometric evidence, combined with efficient per-ray prediction and robust multi-view fusion, provides a framework applicable to a broad set of real-world tasks in robotics, XR, and digital twinning. Its performance and architecture set a reference point for future research toward sharper, more reliable, and computationally efficient 3D object recovery from minimal observations.

PDF Markdown Chat (Upgrade)