ShapeR: Robust 3D Shape Generation

Updated 19 January 2026

ShapeR is a conditional 3D shape generation framework that integrates sparse SLAM, multi-view imaging, and language cues to reconstruct objects from casual captures.
The system employs a rectified flow transformer and a 3D VAE to generate high-fidelity shapes, outperforming existing methods by over 2.7× in Chamfer distance.
A two-stage curriculum with extensive data augmentations ensures robustness to occlusion and clutter, validated through quantitative metrics and user studies.

ShapeR is a conditional 3D shape generation framework designed for robust reconstruction from casually captured image sequences. It leverages visual-inertial SLAM, 3D detection, and multimodal vision-LLMs to overcome challenges endemic to real-world data, such as occlusion, clutter, and poorly segmented inputs. At the core of ShapeR is a rectified flow transformer conditioned on sparse geometric, visual, and linguistic information, enabling high-fidelity metric 3D shape synthesis directly from noisy, unconstrained inputs (Siddiqui et al., 16 Jan 2026).

1. Input Modalities and Conditioning Pipeline

ShapeR conditions on a tri-modal set of input representations per object: sparse SLAM points, posed multi-view images with implicit object masks, and machine-generated text captions.

Sparse SLAM points are extracted using a visual-inertial SLAM pipeline (Direct Sparse Odometry, as in Project Aria) to obtain a metric semi-dense point cloud $P$ with per-frame visibility associations $P_{I^k}$ . A 3D instance detector (EFM3D) predicts axis-aligned bounding boxes, while outlier points within each box are removed by SAM2. Each object's points $P_i \subset P$ are voxelized and passed through a 3D sparse-ConvNet, yielding a token stream $C_{\text{pts}}$ .
Posed multi-view images are selected ( $I_i = \{I_i^1,\dots,I_i^N\}$ ) based on object visibility from the SLAM trajectory. Known extrinsic and intrinsic parameters enable projection of $P_i$ into image frames to generate binary object masks $M_i^j$ . Each image crop is processed by a frozen DINOv2 backbone to generate patch tokens, which are then concatenated with Plücker-encoded ray directions. Mask features are fused with DINO tokens to create $C_{\text{img}}$ .
Vision-language captions are produced by prompting a frozen multi-modal LLM (LLaMA 4, Meta 2025) to describe the object from a representative view, resulting in $T_i$ . These captions are embedded using frozen T5 and CLIP text encoders, providing $C_{\text{txt}}$ .

The complete conditioning set for each object is

$C_i = \bigl\{\,C_{\text{pts}},\;C_{\text{img}},\;C_{\text{txt}}\bigr\}.$

2. Rectified Flow Transformer Architecture

ShapeR’s shape generation pipeline consists of two principal components: a VecSet-based 3D VAE and a rectified flow-matching transformer.

3D VAE (VecSet/Dora): The encoder $E$ samples uniform surface and edge-salient points from the target mesh $S$ , applies cross-attention/downsampling/self-attention, and outputs a latent sequence $z \sim q(z|S)$ , with $z \in \mathbb{R}^{L \times d}$ . The VAE decoder $D$ employs cross-attention from $z$ to arbitrary spatial queries $x \in \mathbb{R}^3$ to predict SDF field values $s(x) = D(z, x)$ . The VAE training loss is

$\mathcal{L}_{\mathrm{VAE}} = \mathbb{E}_{x}\|s(x)-s_{GT}(x)\|^2 + \beta\,\mathrm{KL}(q(z|S)\|\mathcal{N}(0,I)).$

Rectified Flow Matching: Generation is formulated as solving a latent ODE,

$\dot z_t = f_\theta(z_t, t, C), \quad t \in [0,1]$

where $z_1 \sim \mathcal{N}(0,I)$ , and $z_0$ is the ground-truth mesh latent. The flow-matching loss is

$\mathcal{L}_{\mathrm{FM}} = \mathbb{E}_{t, z_0, z_1, C}\,\bigl\|\,f_\theta(z_t, t, C) - (z_0 - z_1)\,\bigr\|^2,$

with $z_t = (1-t)\,z_0 + t\,z_1$ .

Transformer Backbone: The architecture features dual-stream cross-attention: the initial four layers attend to $C_{\text{txt}}$ , followed by layers attending to $C_{\text{img}}$ , and single-stream layers processing $z$ and $C_{\text{pts}}$ . Conditioning is modulated by a learned timestep embedding $\psi(t)$ and a CLIP embedding. No absolute positional embeddings are employed.

3. Data Augmentation and Curriculum Training

ShapeR achieves its robustness through a two-stage training curriculum and extensive on-the-fly augmentation.

Stage 1 (object-centric pretraining): Utilizes approximately 600,000 artist-modeled meshes. Augmentations applied per sample include background compositing, occlusion overlays, visibility fog, motion blur, resolution degradation, and photometric jitter for images; point dropout, trajectory subsampling, Gaussian noise, and occlusions for SLAM points. All augmentations are applied online.
Stage 2 (scene-centric fine-tuning): Operates on synthetic Aria Environments (SceneScript), capturing the complexities of real-world clutter, inter-object occlusions, and SLAM uncertainties. Fine-tuning on such crops narrows the domain gap between idealized datasets and genuinely casual scenes.

This curriculum instills both strong global shape priors and the capacity for segmentation and completion in difficult sensing conditions.

4. Strategies for Handling Background Clutter

ShapeR addresses background clutter and occlusion without explicit mask supervision or dedicated loss terms for background rejection.

Implicit segmentation emerges via the joint encoding of $C_{\text{pts}}$ and the projected point masks $M_i$ . The model learns to focus on the foreground object intrinsically.
Point-mask prompting involves supplementing DINO image features with mask tokens, guiding attention toward the object. Ablation experiments indicate that removing this feature increases confusion with nearby objects.
Robust pre-processing via 3D instance detection (EFM3D) and post-hoc SAM2 filtering allows isolation of objects in highly cluttered or occluded contexts.

The multimodal pipeline ensures that attention is directed toward valid object regions even in visually ambiguous scenes.

5. Evaluation Benchmark and Quantitative Results

A new “in-the-wild” evaluation benchmark is introduced, featuring 178 manually annotated indoor objects spanning seven Project Aria sequences (furniture, appliances, tools, etc.). The annotation process aligns an idealized mesh acquired ex-situ with its original scene via 2D silhouettes and SLAM points, yielding ground truth suitable for metric evaluation.

Evaluation employs three principal metrics (in normalized space):

Chamfer $\ell_2$ distance (CD, lower better)
Normal consistency (NC, higher better)
F₁ score at 1% threshold (higher better)

Empirical results demonstrate the following:

Method	Chamfer CD	NC	F₁ (1%)
EFM3D (scene fusion)	13.82	–	–
FoundationStereo fusion	6.48	–	–
LIRM (segmentation-based)	8.05	–	–
DP-Recon	8.36	–	–
ShapeR	2.375	0.810	0.722

ShapeR outperforms existing multi-view and segmentation-based baselines by approximately 2.7× in Chamfer distance. In user studies (Table 2), ShapeR shapes are preferred at rates of ≈85–88% over leading image-to-3D models, without reliance on interactive masks.

6. Ablation Studies and Component Analysis

Ablation experiments isolate the impact of each principal component by measuring Chamfer distance (× $10^{-2}$ ):

Component Removed	Chamfer CD
None (full model)	2.375
No SLAM points	4.514
No point augmentations	3.276
No image augmentations	3.397
No two-stage training	3.053
No 2D point-mask prompt	2.568

SLAM points act as a global geometric anchor, while both image and point augmentations are critical for noise robustness. Scene-level fine-tuning reduces domain adaptation challenges, and point-mask prompting sharpens focus on the correct object under occlusion or clutter.

7. Inference Process and Synthesis Pipeline

The per-object synthesis pipeline is expressed as follows:

P_i, I_i, Pi_i, M_i, T_i = SLAM_extract(sequence) ──> C_i
normalize_points(P_i) to [-1,1]^3  

z = N(0,I)                 # initial noise latent
for t in [1.0, …, Δt]:
    z = z + Δt * f_theta(z, t, C_i)

sdf_grid = D(z)            # query in a 3D grid
S_i = marching_cubes(sdf_grid)
rescale_mesh(S_i, P_i)     # back to metric coordinates

This streamlined workflow generates metric-accurate 3D meshes directly aligned to the real-world coordinate frame, without explicit segmentation or post-hoc mesh refinement.

ShapeR’s integration of metric SLAM anchors, multimodal image and caption conditioning, a flow-matching transformer, extensive multimodal augmentations, and a two-stage curriculum yields a system capable of predictable, high-fidelity 3D object generation from casual scene captures—demonstrably surpassing prior approaches in both quantitative and user study evaluations (Siddiqui et al., 16 Jan 2026).

Markdown Upgrade to Chat

References (1)

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShapeR.

ShapeR: Robust 3D Shape Generation

1. Input Modalities and Conditioning Pipeline

2. Rectified Flow Transformer Architecture

3. Data Augmentation and Curriculum Training

4. Strategies for Handling Background Clutter

5. Evaluation Benchmark and Quantitative Results

6. Ablation Studies and Component Analysis

7. Inference Process and Synthesis Pipeline

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

ShapeR: Robust 3D Shape Generation

1. Input Modalities and Conditioning Pipeline

2. Rectified Flow Transformer Architecture

3. Data Augmentation and Curriculum Training

4. Strategies for Handling Background Clutter

5. Evaluation Benchmark and Quantitative Results

6. Ablation Studies and Component Analysis

7. Inference Process and Synthesis Pipeline

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research