ShapeR: Robust 3D Shape Generation
- ShapeR is a conditional 3D shape generation framework that integrates sparse SLAM, multi-view imaging, and language cues to reconstruct objects from casual captures.
- The system employs a rectified flow transformer and a 3D VAE to generate high-fidelity shapes, outperforming existing methods by over 2.7× in Chamfer distance.
- A two-stage curriculum with extensive data augmentations ensures robustness to occlusion and clutter, validated through quantitative metrics and user studies.
ShapeR is a conditional 3D shape generation framework designed for robust reconstruction from casually captured image sequences. It leverages visual-inertial SLAM, 3D detection, and multimodal vision-LLMs to overcome challenges endemic to real-world data, such as occlusion, clutter, and poorly segmented inputs. At the core of ShapeR is a rectified flow transformer conditioned on sparse geometric, visual, and linguistic information, enabling high-fidelity metric 3D shape synthesis directly from noisy, unconstrained inputs (Siddiqui et al., 16 Jan 2026).
1. Input Modalities and Conditioning Pipeline
ShapeR conditions on a tri-modal set of input representations per object: sparse SLAM points, posed multi-view images with implicit object masks, and machine-generated text captions.
- Sparse SLAM points are extracted using a visual-inertial SLAM pipeline (Direct Sparse Odometry, as in Project Aria) to obtain a metric semi-dense point cloud with per-frame visibility associations . A 3D instance detector (EFM3D) predicts axis-aligned bounding boxes, while outlier points within each box are removed by SAM2. Each object's points are voxelized and passed through a 3D sparse-ConvNet, yielding a token stream .
- Posed multi-view images are selected () based on object visibility from the SLAM trajectory. Known extrinsic and intrinsic parameters enable projection of into image frames to generate binary object masks . Each image crop is processed by a frozen DINOv2 backbone to generate patch tokens, which are then concatenated with Plücker-encoded ray directions. Mask features are fused with DINO tokens to create .
- Vision-language captions are produced by prompting a frozen multi-modal LLM (LLaMA 4, Meta 2025) to describe the object from a representative view, resulting in . These captions are embedded using frozen T5 and CLIP text encoders, providing .
The complete conditioning set for each object is
2. Rectified Flow Transformer Architecture
ShapeR’s shape generation pipeline consists of two principal components: a VecSet-based 3D VAE and a rectified flow-matching transformer.
- 3D VAE (VecSet/Dora): The encoder samples uniform surface and edge-salient points from the target mesh , applies cross-attention/downsampling/self-attention, and outputs a latent sequence , with . The VAE decoder employs cross-attention from to arbitrary spatial queries to predict SDF field values . The VAE training loss is
- Rectified Flow Matching: Generation is formulated as solving a latent ODE,
where , and is the ground-truth mesh latent. The flow-matching loss is
with .
- Transformer Backbone: The architecture features dual-stream cross-attention: the initial four layers attend to , followed by layers attending to , and single-stream layers processing and . Conditioning is modulated by a learned timestep embedding and a CLIP embedding. No absolute positional embeddings are employed.
3. Data Augmentation and Curriculum Training
ShapeR achieves its robustness through a two-stage training curriculum and extensive on-the-fly augmentation.
- Stage 1 (object-centric pretraining): Utilizes approximately 600,000 artist-modeled meshes. Augmentations applied per sample include background compositing, occlusion overlays, visibility fog, motion blur, resolution degradation, and photometric jitter for images; point dropout, trajectory subsampling, Gaussian noise, and occlusions for SLAM points. All augmentations are applied online.
- Stage 2 (scene-centric fine-tuning): Operates on synthetic Aria Environments (SceneScript), capturing the complexities of real-world clutter, inter-object occlusions, and SLAM uncertainties. Fine-tuning on such crops narrows the domain gap between idealized datasets and genuinely casual scenes.
This curriculum instills both strong global shape priors and the capacity for segmentation and completion in difficult sensing conditions.
4. Strategies for Handling Background Clutter
ShapeR addresses background clutter and occlusion without explicit mask supervision or dedicated loss terms for background rejection.
- Implicit segmentation emerges via the joint encoding of and the projected point masks . The model learns to focus on the foreground object intrinsically.
- Point-mask prompting involves supplementing DINO image features with mask tokens, guiding attention toward the object. Ablation experiments indicate that removing this feature increases confusion with nearby objects.
- Robust pre-processing via 3D instance detection (EFM3D) and post-hoc SAM2 filtering allows isolation of objects in highly cluttered or occluded contexts.
The multimodal pipeline ensures that attention is directed toward valid object regions even in visually ambiguous scenes.
5. Evaluation Benchmark and Quantitative Results
A new “in-the-wild” evaluation benchmark is introduced, featuring 178 manually annotated indoor objects spanning seven Project Aria sequences (furniture, appliances, tools, etc.). The annotation process aligns an idealized mesh acquired ex-situ with its original scene via 2D silhouettes and SLAM points, yielding ground truth suitable for metric evaluation.
Evaluation employs three principal metrics (in normalized space):
- Chamfer distance (CD, lower better)
- Normal consistency (NC, higher better)
- F₁ score at 1% threshold (higher better)
Empirical results demonstrate the following:
| Method | Chamfer CD | NC | F₁ (1%) |
|---|---|---|---|
| EFM3D (scene fusion) | 13.82 | – | – |
| FoundationStereo fusion | 6.48 | – | – |
| LIRM (segmentation-based) | 8.05 | – | – |
| DP-Recon | 8.36 | – | – |
| ShapeR | 2.375 | 0.810 | 0.722 |
ShapeR outperforms existing multi-view and segmentation-based baselines by approximately 2.7× in Chamfer distance. In user studies (Table 2), ShapeR shapes are preferred at rates of ≈85–88% over leading image-to-3D models, without reliance on interactive masks.
6. Ablation Studies and Component Analysis
Ablation experiments isolate the impact of each principal component by measuring Chamfer distance (×):
| Component Removed | Chamfer CD |
|---|---|
| None (full model) | 2.375 |
| No SLAM points | 4.514 |
| No point augmentations | 3.276 |
| No image augmentations | 3.397 |
| No two-stage training | 3.053 |
| No 2D point-mask prompt | 2.568 |
SLAM points act as a global geometric anchor, while both image and point augmentations are critical for noise robustness. Scene-level fine-tuning reduces domain adaptation challenges, and point-mask prompting sharpens focus on the correct object under occlusion or clutter.
7. Inference Process and Synthesis Pipeline
The per-object synthesis pipeline is expressed as follows:
1 2 3 4 5 6 7 8 9 10 |
P_i, I_i, Pi_i, M_i, T_i = SLAM_extract(sequence) ──> C_i normalize_points(P_i) to [-1,1]^3 z = N(0,I) # initial noise latent for t in [1.0, …, Δt]: z = z + Δt * f_theta(z, t, C_i) sdf_grid = D(z) # query in a 3D grid S_i = marching_cubes(sdf_grid) rescale_mesh(S_i, P_i) # back to metric coordinates |
This streamlined workflow generates metric-accurate 3D meshes directly aligned to the real-world coordinate frame, without explicit segmentation or post-hoc mesh refinement.
ShapeR’s integration of metric SLAM anchors, multimodal image and caption conditioning, a flow-matching transformer, extensive multimodal augmentations, and a two-stage curriculum yields a system capable of predictable, high-fidelity 3D object generation from casual scene captures—demonstrably surpassing prior approaches in both quantitative and user study evaluations (Siddiqui et al., 16 Jan 2026).