Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShapeR: Robust 3D Shape Generation

Updated 19 January 2026
  • ShapeR is a conditional 3D shape generation framework that integrates sparse SLAM, multi-view imaging, and language cues to reconstruct objects from casual captures.
  • The system employs a rectified flow transformer and a 3D VAE to generate high-fidelity shapes, outperforming existing methods by over 2.7× in Chamfer distance.
  • A two-stage curriculum with extensive data augmentations ensures robustness to occlusion and clutter, validated through quantitative metrics and user studies.

ShapeR is a conditional 3D shape generation framework designed for robust reconstruction from casually captured image sequences. It leverages visual-inertial SLAM, 3D detection, and multimodal vision-LLMs to overcome challenges endemic to real-world data, such as occlusion, clutter, and poorly segmented inputs. At the core of ShapeR is a rectified flow transformer conditioned on sparse geometric, visual, and linguistic information, enabling high-fidelity metric 3D shape synthesis directly from noisy, unconstrained inputs (Siddiqui et al., 16 Jan 2026).

1. Input Modalities and Conditioning Pipeline

ShapeR conditions on a tri-modal set of input representations per object: sparse SLAM points, posed multi-view images with implicit object masks, and machine-generated text captions.

  • Sparse SLAM points are extracted using a visual-inertial SLAM pipeline (Direct Sparse Odometry, as in Project Aria) to obtain a metric semi-dense point cloud PP with per-frame visibility associations PIkP_{I^k}. A 3D instance detector (EFM3D) predicts axis-aligned bounding boxes, while outlier points within each box are removed by SAM2. Each object's points PiPP_i \subset P are voxelized and passed through a 3D sparse-ConvNet, yielding a token stream CptsC_{\text{pts}}.
  • Posed multi-view images are selected (Ii={Ii1,,IiN}I_i = \{I_i^1,\dots,I_i^N\}) based on object visibility from the SLAM trajectory. Known extrinsic and intrinsic parameters enable projection of PiP_i into image frames to generate binary object masks MijM_i^j. Each image crop is processed by a frozen DINOv2 backbone to generate patch tokens, which are then concatenated with Plücker-encoded ray directions. Mask features are fused with DINO tokens to create CimgC_{\text{img}}.
  • Vision-language captions are produced by prompting a frozen multi-modal LLM (LLaMA 4, Meta 2025) to describe the object from a representative view, resulting in TiT_i. These captions are embedded using frozen T5 and CLIP text encoders, providing CtxtC_{\text{txt}}.

The complete conditioning set for each object is

Ci={Cpts,  Cimg,  Ctxt}.C_i = \bigl\{\,C_{\text{pts}},\;C_{\text{img}},\;C_{\text{txt}}\bigr\}.

2. Rectified Flow Transformer Architecture

ShapeR’s shape generation pipeline consists of two principal components: a VecSet-based 3D VAE and a rectified flow-matching transformer.

  • 3D VAE (VecSet/Dora): The encoder EE samples uniform surface and edge-salient points from the target mesh SS, applies cross-attention/downsampling/self-attention, and outputs a latent sequence zq(zS)z \sim q(z|S), with zRL×dz \in \mathbb{R}^{L \times d}. The VAE decoder DD employs cross-attention from zz to arbitrary spatial queries xR3x \in \mathbb{R}^3 to predict SDF field values s(x)=D(z,x)s(x) = D(z, x). The VAE training loss is

LVAE=Exs(x)sGT(x)2+βKL(q(zS)N(0,I)).\mathcal{L}_{\mathrm{VAE}} = \mathbb{E}_{x}\|s(x)-s_{GT}(x)\|^2 + \beta\,\mathrm{KL}(q(z|S)\|\mathcal{N}(0,I)).

  • Rectified Flow Matching: Generation is formulated as solving a latent ODE,

z˙t=fθ(zt,t,C),t[0,1]\dot z_t = f_\theta(z_t, t, C), \quad t \in [0,1]

where z1N(0,I)z_1 \sim \mathcal{N}(0,I), and z0z_0 is the ground-truth mesh latent. The flow-matching loss is

LFM=Et,z0,z1,Cfθ(zt,t,C)(z0z1)2,\mathcal{L}_{\mathrm{FM}} = \mathbb{E}_{t, z_0, z_1, C}\,\bigl\|\,f_\theta(z_t, t, C) - (z_0 - z_1)\,\bigr\|^2,

with zt=(1t)z0+tz1z_t = (1-t)\,z_0 + t\,z_1.

  • Transformer Backbone: The architecture features dual-stream cross-attention: the initial four layers attend to CtxtC_{\text{txt}}, followed by layers attending to CimgC_{\text{img}}, and single-stream layers processing zz and CptsC_{\text{pts}}. Conditioning is modulated by a learned timestep embedding ψ(t)\psi(t) and a CLIP embedding. No absolute positional embeddings are employed.

3. Data Augmentation and Curriculum Training

ShapeR achieves its robustness through a two-stage training curriculum and extensive on-the-fly augmentation.

  • Stage 1 (object-centric pretraining): Utilizes approximately 600,000 artist-modeled meshes. Augmentations applied per sample include background compositing, occlusion overlays, visibility fog, motion blur, resolution degradation, and photometric jitter for images; point dropout, trajectory subsampling, Gaussian noise, and occlusions for SLAM points. All augmentations are applied online.
  • Stage 2 (scene-centric fine-tuning): Operates on synthetic Aria Environments (SceneScript), capturing the complexities of real-world clutter, inter-object occlusions, and SLAM uncertainties. Fine-tuning on such crops narrows the domain gap between idealized datasets and genuinely casual scenes.

This curriculum instills both strong global shape priors and the capacity for segmentation and completion in difficult sensing conditions.

4. Strategies for Handling Background Clutter

ShapeR addresses background clutter and occlusion without explicit mask supervision or dedicated loss terms for background rejection.

  • Implicit segmentation emerges via the joint encoding of CptsC_{\text{pts}} and the projected point masks MiM_i. The model learns to focus on the foreground object intrinsically.
  • Point-mask prompting involves supplementing DINO image features with mask tokens, guiding attention toward the object. Ablation experiments indicate that removing this feature increases confusion with nearby objects.
  • Robust pre-processing via 3D instance detection (EFM3D) and post-hoc SAM2 filtering allows isolation of objects in highly cluttered or occluded contexts.

The multimodal pipeline ensures that attention is directed toward valid object regions even in visually ambiguous scenes.

5. Evaluation Benchmark and Quantitative Results

A new “in-the-wild” evaluation benchmark is introduced, featuring 178 manually annotated indoor objects spanning seven Project Aria sequences (furniture, appliances, tools, etc.). The annotation process aligns an idealized mesh acquired ex-situ with its original scene via 2D silhouettes and SLAM points, yielding ground truth suitable for metric evaluation.

Evaluation employs three principal metrics (in normalized space):

  • Chamfer 2\ell_2 distance (CD, lower better)
  • Normal consistency (NC, higher better)
  • F₁ score at 1% threshold (higher better)

Empirical results demonstrate the following:

Method Chamfer CD NC F₁ (1%)
EFM3D (scene fusion) 13.82
FoundationStereo fusion 6.48
LIRM (segmentation-based) 8.05
DP-Recon 8.36
ShapeR 2.375 0.810 0.722

ShapeR outperforms existing multi-view and segmentation-based baselines by approximately 2.7× in Chamfer distance. In user studies (Table 2), ShapeR shapes are preferred at rates of ≈85–88% over leading image-to-3D models, without reliance on interactive masks.

6. Ablation Studies and Component Analysis

Ablation experiments isolate the impact of each principal component by measuring Chamfer distance (×10210^{-2}):

Component Removed Chamfer CD
None (full model) 2.375
No SLAM points 4.514
No point augmentations 3.276
No image augmentations 3.397
No two-stage training 3.053
No 2D point-mask prompt 2.568

SLAM points act as a global geometric anchor, while both image and point augmentations are critical for noise robustness. Scene-level fine-tuning reduces domain adaptation challenges, and point-mask prompting sharpens focus on the correct object under occlusion or clutter.

7. Inference Process and Synthesis Pipeline

The per-object synthesis pipeline is expressed as follows:

1
2
3
4
5
6
7
8
9
10
P_i, I_i, Pi_i, M_i, T_i = SLAM_extract(sequence) > C_i
normalize_points(P_i) to [-1,1]^3  

z = N(0,I)                 # initial noise latent
for t in [1.0, , Δt]:
    z = z + Δt * f_theta(z, t, C_i)

sdf_grid = D(z)            # query in a 3D grid
S_i = marching_cubes(sdf_grid)
rescale_mesh(S_i, P_i)     # back to metric coordinates

This streamlined workflow generates metric-accurate 3D meshes directly aligned to the real-world coordinate frame, without explicit segmentation or post-hoc mesh refinement.


ShapeR’s integration of metric SLAM anchors, multimodal image and caption conditioning, a flow-matching transformer, extensive multimodal augmentations, and a two-stage curriculum yields a system capable of predictable, high-fidelity 3D object generation from casual scene captures—demonstrably surpassing prior approaches in both quantitative and user study evaluations (Siddiqui et al., 16 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShapeR.