Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Published 16 Jan 2026 in cs.CV and cs.LG | (2601.11514v1)

Abstract: Recent advances in 3D shape generation have achieved impressive results, but most existing methods rely on clean, unoccluded, and well-segmented inputs. Such conditions are rarely met in real-world scenarios. We present ShapeR, a novel approach for conditional 3D object shape generation from casually captured sequences. Given an image sequence, we leverage off-the-shelf visual-inertial SLAM, 3D detection algorithms, and vision-LLMs to extract, for each object, a set of sparse SLAM points, posed multi-view images, and machine-generated captions. A rectified flow transformer trained to effectively condition on these modalities then generates high-fidelity metric 3D shapes. To ensure robustness to the challenges of casually captured data, we employ a range of techniques including on-the-fly compositional augmentations, a curriculum training scheme spanning object- and scene-level datasets, and strategies to handle background clutter. Additionally, we introduce a new evaluation benchmark comprising 178 in-the-wild objects across 7 real-world scenes with geometry annotations. Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7x in Chamfer distance compared to state of the art.

Summary

  • The paper introduces a large-scale rectified flow model for robust, metric, object-level 3D reconstruction from casual, multi-view captures.
  • It fuses geometric cues from SLAM points, multi-view images, camera poses, and language captions to denoise latent representations within a VAE framework.
  • Extensive data augmentation and two-stage training, supported by a novel evaluation dataset, demonstrate its superior performance in challenging scenarios.

ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

Motivation and Context

Existing 3D generative models achieve high-fidelity shape reconstruction from clean, segmented inputs, but their reliability degrades significantly in real-world settings with occlusions, clutter, poor viewpoints, and sensor noise. These conditions are typical for human-captured sequences that are not conducted with careful scanning protocols. Scene-level models often deliver incomplete object geometry under occlusion, while object-level methods—especially those relying on off-the-shelf segmentation—struggle to maintain fidelity and consistency when presented with real, casual, in-the-wild image sequences. Figure 1

Figure 1: Scene examples reveal the major challenges casual captures introduce—clutter, occlusion, low resolution, noise, and ambiguous object boundaries all degrade existing SoTA model performance, while ShapeR reconstructs robustly.

ShapeR is introduced as a large-scale rectified flow model explicitly designed for metrically accurate, robust, and automatic object-level 3D reconstruction from casually captured, posed multi-view sequences. The method targets the core limitations of current approaches by deeply integrating geometric signals (sparse metric SLAM points), multi-view imagery, and multimodal cues (captions), and by specifically augmenting and training to reject degradations in input quality.

Methodology

ShapeR’s backbone is a multimodally conditioned rectified flow matching network operating in the latent space of a variational autoencoder (VAE) on 3D geometry. Input sequences are preprocessed using visual-inertial SLAM and 3D object detection to extract sparse metric point clouds, object-centric local image crops, camera poses, and vision-language captions per instance.

The denoising architecture is built atop the FLUX DiT paradigm, handling latent “VecSets” representing each shape. It conditions explicitly and jointly on:

  • SLAM point tokens (encoded via a 3D ResNet),
  • multi-view images (processed by a frozen DINOv2 vision backbone),
  • camera poses (Plücker encoding),
  • projection masks (2D convolutional encoding of object point projections into the images),
  • and captions (T5 and CLIP embeddings).

The transformer fuses all modalities, robustly denoises Gaussian-noise-perturbed latents toward the true instance’s latent code, and eventually decodes each latent sequence into a signed-distance field (SDF), producing a mesh via marching cubes. Figure 2

Figure 2: The model architecture combines posed images, SLAM points, captions, and projected 2D masks to robustly condition SDF mesh denoising and decoding.

Incorporating SLAM point clouds provides SLAM-derived, aggregated geometric priors for the object—critical for accurate metric scale, disambiguation under noise, and mitigating the often-weak coplanarity and segmentation boundaries in images. Figure 3

Figure 3: Explicit conditioning on SLAM point clouds enhances reconstruction robustness and captures persistent shape cues unavailable in images alone.

Data Augmentation and Training

Training proceeds in two major stages. First, pretraining uses 600k+ synthetic object meshes with heavy compositional and modality-specific augmentations to simulate clutter, occlusion, masking errors, and degradations across all modalities—adversarially targeting the corruptions observed in natural captures. The second stage fine-tunes on synthetic scene crops (from Aria Synthetic Environment) that increase realism via occlusion, point cloud sparsity/noise, and inter-object interactions. Figure 4

Figure 4: The training regimen: aggressive augmentation for pretraining (left) and scene-level, occlusion-rich crops for fine-tuning (right).

The inference pipeline is fully automatic. Per-object conditioning sets are constructed by associating SLAM points to views and generating point projections, then sequences are processed with no need for explicit image segmentation or manual intervention. Final outputs are guaranteed metric by direct rescaling with the detected point set’s real-space bounds.

Empirical Evaluation and Benchmarks

A key contribution is the ShapeR Evaluation Dataset: 178 annotated 3D object shapes across 7 scenes, captured casually with SLAM point clouds and ground-truth rigidly aligned meshes, filling a gap between studio control and wild-scene incompleteness.

Quantitative Metrics: Core geometry metrics (normalized Chamfer-2\ell_2, normal consistency, F1 at 1% threshold) are compared across nine baselines, including EFM3D, FoundationStereo, LIRM, and DP-Recon. Foundation image-to-3D models—TripoSG, Direct3DS2, Hunyuan3D-2.0, Amodal3R—are also evaluated but always with high-quality mask and view selection for a fair upper-bound comparison. Figure 5

Figure 5: Qualitative comparisons show ShapeR reconstructs fine-detailed, metrically accurate object shapes in the presence of occlusion and background clutter where other methods either fail or hallucinate interpolations.

User Study: Human evaluators consistently prefer ShapeR’s reconstructions (>>80\% win rate over leading image-to-3D methods) given its consistency, metric accuracy, and completeness. Figure 6

Figure 6: Foundation models can produce reasonable instance geometry when manually curated, but ShapeR operates end-to-end, robustly separating and reconstructing objects without any masking or view manual selection.

ShapeR also excels at reconstructing complex scene arrangements, automatically producing consistent metric layouts without interactive segmentation, surpassing MIDI3D and SceneGen in both utility and physical plausibility. Figure 7

Figure 7: Scene-level assembly: ShapeR’s metric outputs remain consistent across objects and layouts, whereas image-to-scene approaches frequently fail in clutter.

Ablation Studies

Ablations reveal the criticality of each pipeline component. SLAM points significantly improve metric accuracy and robustness compared to image-only. On-the-fly modality augmentations (especially point cloud and image corruptions) are found vital for downstream generalization. Excluding 2-stage curriculum training, background compositing, or 2D point mask prompting all degrade performance, confirming the necessity of each for challenging, casual scenes. Figure 8

Figure 8: Augmentation and conditioning studies demonstrate improvements in robustness, avoidance of overfitting to incomplete point observations, and reductions in object confusion.

Dataset Construction and Limitations

The ShapeR Evaluation Dataset is carefully constructed via both casual, in situ capture and controlled meshing to produce complete, metrically valid ground truth geometry, enabling fair quantitative and qualitative evaluation against prior work. Comparison with prior datasets explicitly demonstrates the added complexity—and the need for robust reconstruction—in wild-capture scenarios. Figure 9

Figure 9: Comparison demonstrates that existing datasets either lack realism or object completeness, whereas ShapeR’s dataset fills this critical evaluation gap.

Figure 10

Figure 10: Distribution and examples of object annotations in the ShapeR benchmark; coverage spans categories, sizes, and occlusion scenarios.

Ground-truth meshes are created by displacing objects for high-quality imaging, performing image-to-3D modeling, then manually inserting, verifying, and refining meshes back in the original sequence, supporting reliable geometric and pose alignment. Figure 11

Figure 11: Pseudo-ground truth mesh creation: isolate the object (left), reconstruct in ideal conditions (mid), then validate and realign in the original context (right).

Further experiments indicate the metric stability of ShapeR on other datasets, resilience to increasingly casual capture configurations, and flexibility to operate with monocular reconstructions using external metric estimators such as MapAnything.

Failure Cases and Limitations

ShapeR’s limitations are primarily: failures under severe image degradation or highly limited viewpoints (yielding rough/incomplete shapes), difficulty cleanly separating objects that are physically stacked/attached, and propagation of upstream 3D detection errors—if an object is not detected, it cannot be reconstructed. Figure 12

Figure 12: Failure cases in limited-fidelity input, adjoining object contamination, and dependence on accurate detection.

Implications and Future Directions

ShapeR advances large-scale, multimodal conditional 3D generative modeling by demonstrating that robust, metric, and per-object geometry is achievable from highly imperfect, casual inputs. The architecture serves as a unified solution bridging generative per-object 3D modeling and metric scene reconstruction, with the robustness to act as the core of scalable scene understanding pipelines. Reliance on SLAM points and the absence of segmentation masks greatly reduce human interaction and annotation requirements.

Practically, ShapeR enables AR/VR and robotics pipelines to automatically produce trustworthy geometry in unconstrained environments, driving progress toward real-time, reliable 3D digital twins. Theoretically, ShapeR’s results suggest the value in tightly integrating geometric signals and vision-LLMs in 3D generation, as well as the power of adversarial augmentation and curriculum-based training regimens. Future work may address upstream detection bottlenecks, stronger instance separation, and extending to fully scene-wide interaction and relighting-ready assets.

Conclusion

ShapeR establishes a new state-of-the-art for automatic, metric, object-level 3D reconstruction in casual, real-world scenes, leveraging multimodal conditioning and aggressive, curriculum-driven training to overcome segmentation and input degradations. Through both the algorithm and the benchmark dataset, ShapeR provides a foundation for robust, large-scale, multimodal 3D generative modeling and sets an empirical standard for practical deployment in unconstrained environments (2601.11514).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces ShapeR, a computer system that can build accurate 3D models of everyday objects from short, casual phone videos. Unlike many previous methods that only work with clean, studio-like images or perfectly cut-out objects, ShapeR is designed for real-life, messy situations with clutter, occlusions, motion blur, and weird camera angles. It also keeps “metric” scale, meaning the size of the 3D model matches the real-world size.

Key Objectives

The paper sets out to answer simple but important questions:

  • How can we turn a casual video of a scene (like a phone walk-around) into complete, true-to-scale 3D shapes of the individual objects in it?
  • Can we make this work even when objects are partly hidden, the camera moves unpredictably, and the background is messy?
  • Can we avoid requiring perfect object cut-outs in every image and still get great results?

How ShapeR Works (Methods)

Think of ShapeR as a smart tool that combines multiple clues to figure out an object’s 3D shape—like a detective using photos, dots, and labels to reconstruct a model.

The clues ShapeR uses

To make sense of a casual video, ShapeR gathers these inputs:

  • Posed images: frames from the video where the system knows where the camera was for each shot.
  • Sparse 3D points: a “connect-the-dots” set of points in space found by SLAM (software that tracks the camera and maps some 3D points from the video).
  • Captions: short text descriptions of the object (like “a red toaster”) generated by a vision-LLM.
  • Point projections: for the images it picks, it projects those 3D points onto the 2D pictures to roughly indicate where the object is in each view (no perfect cut-outs needed).

Together, these form a “multimodal” set of clues: images, 3D dots, and text.

The core idea, in everyday terms

  • Shape code: ShapeR learns a compact “code” that describes a 3D shape (imagine a super-compressed blueprint).
  • From noise to shape: It starts from random noise and learns how to “flow” that noise into the correct shape code by following small “arrows” that point in the right direction. This is called a rectified flow model. You can think of it as sculpting a statue from a messy clay blob using guidance from the images, dots, and caption.
  • Decoding the shape: Once the code is ready, a decoder turns it into a full 3D mesh (the surface of the object). The mesh is then scaled back to match the object’s real size using the 3D points.

Training to handle messy videos

To make ShapeR tough and reliable, the authors train it in two stages and add lots of realistic “noise” during training:

  • Stage 1: Pretraining on a huge variety of individual 3D objects. They deliberately mess up the inputs during training (add clutter, blur, low resolution, fake occlusions, point noise) so the model learns to be robust.
  • Stage 2: Fine-tuning on more realistic synthetic scenes where multiple objects interact and occlude each other, better matching real-life videos.

This “curriculum” (easy first, then more realistic) helps ShapeR generalize to real-world captures.

A few helpful translations of terms

  • SLAM: A method that figures out where the camera is and maps points in 3D, like leaving breadcrumb dots in space as you walk around.
  • Sparse point cloud: Those breadcrumb dots—spread out, not dense, but still helpful.
  • Metric: True-to-scale; a 1-meter table in real life is 1 meter in the 3D model.
  • Mesh: The surface of the 3D object (like a net of tiny triangles).
  • Flow model: A process that transforms noise into a meaningful shape code step-by-step, guided by the clues.

Main Findings

The authors built a new evaluation set with 178 objects across 7 real, cluttered scenes (images, camera poses, SLAM points, and high-quality shape references). On this tough benchmark:

  • ShapeR significantly outperforms previous state-of-the-art methods in how close the reconstructed shape is to the ground truth, improving Chamfer distance by about 2.7× (lower is better; it measures how close two shapes’ surfaces are).
  • It works even when:
    • Objects are partly hidden or mixed in with other stuff,
    • The views are poor,
    • The image quality isn’t great, and
    • No perfect object cut-outs are provided.
  • It keeps real-world scale, so objects are the right size and fit together consistently in the scene.
  • In user preferences against strong single-image 3D methods, people chose ShapeR’s results far more often, even though ShapeR runs automatically without manual help.

Why this matters: Many existing methods break down in real scenes because they depend on clean, segmented images or don’t preserve real scale. ShapeR solves both problems by using multiple clues and robust training.

Implications and Impact

  • Practical 3D scanning: ShapeR brings us closer to simply walking around with a phone and getting reliable, to-scale 3D models of all objects in a room—no studio setup or manual tracing needed.
  • Better AR/VR and robotics: Knowing accurate object shapes and sizes helps virtual experiences feel realistic and helps robots understand and navigate real spaces.
  • Scene editing and asset creation: Designers and developers can extract high-quality, metrically accurate 3D assets from everyday videos.
  • Research and benchmarking: The authors provide a new, realistic dataset and plan to release code and models, which can accelerate progress in object-centric 3D reconstruction.

In short, ShapeR is a big step toward automatic, reliable, and true-to-scale 3D object reconstruction from the kind of casual videos people actually capture in the real world.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following points identify what is missing, uncertain, or left unexplored in the paper, framed as concrete, actionable directions for future work:

  • Quantitative metric grounding: validate “metric consistency” by reporting absolute scale and pose errors in physical units (e.g., mm/cm) rather than only shape metrics computed in normalized coordinates; add scene-level metrics (e.g., contact alignment, collision rate, scale drift).
  • Scene composition evaluation: provide quantitative evaluation of assembled scenes (object-to-object and object-to-support relationships), including spatial consistency, collisions, and alignment with background surfaces.
  • Modality contribution analysis: ablate text captions, pose encoding (Plücker rays), and image features (frozen DINOv2) to quantify each modality’s incremental benefit; test caption removal and random/incorrect captions to assess sensitivity.
  • Dependence on 3D detection and SAM2 refinement: measure pipeline robustness to detection recall/precision errors and SAM2 failures; evaluate end-to-end variants that reduce or remove reliance on external instance detectors and mask-based point filtering.
  • SLAM dependence and sensor diversity: test across different SLAM/SfM algorithms and sensors (e.g., smartphone monocular/RGB-only, stereo, different IMU qualities); quantify performance degradation under varying pose accuracy and point cloud sparsity.
  • Sensitivity to capture conditions: systematically study the effect of number of views, view selection strategy, image resolution, motion blur, and persistent occlusions on reconstruction quality.
  • Robustness to sparse or unreliable points: characterize performance as a function of SLAM point density, distribution, and noise; define minimum visibility/point thresholds for acceptable reconstructions.
  • Dynamic and non-rigid objects: extend and evaluate on moving, articulated, and deformable objects to understand limitations of the current static, rigid assumption.
  • Topology and thin-structure fidelity: measure watertightness, topological correctness, and recovery of thin parts (e.g., handles, cables); report SDF decoding resolution and its effect on topology.
  • Uncertainty estimation: add shape uncertainty/confidence maps or probabilistic outputs to flag regions reconstructed mostly from priors versus observed evidence.
  • Caption quality and semantics: evaluate the impact of caption accuracy, granularity, and noise; analyze failure modes when captions contradict visual/point signals or are generic.
  • Generalization beyond Aria captures: benchmark on public, non-Aria datasets (smartphone casual videos, outdoor scenes, varied lighting) to test cross-domain robustness.
  • Dataset scale and diversity: expand beyond 178 objects in 7 scenes to cover more categories (including reflective/translucent/textureless objects), environments (outdoors, industrial), and extreme occlusion/clutter cases.
  • Ground-truth provenance and bias: quantify accuracy of “reference meshes” created with internal methods; report alignment and refinement procedures and their residual error to avoid benchmark bias.
  • Fairness in baselines: compare against multi-view baselines under matched conditions (posed views, realistic segmentation/no segmentation) and report performance when GT-based cropping is not available.
  • Runtime and scalability: report training and inference time/memory, throughput per scene, and scaling behavior with number of objects; characterize practicality for large scenes and on-device deployment.
  • Latent length policy: study how variable VecSet length L affects quality, efficiency, and failure modes; develop criteria or predictors to select L per object.
  • Failure mode analysis: provide systematic taxonomy and quantitative rates (e.g., adjacent-object confusion, over-completion, missing occluded regions), and correlate failures with input conditions.
  • Augmentation efficacy breakdown: quantify the individual contribution of augmentation components (background compositing, occlusion overlays, SLAM point dropout/noise) with targeted ablations and controlled stress tests.
  • Pose error robustness: inject controlled extrinsic/intrinsic noise during evaluation to measure tolerance and potential correction mechanisms (e.g., joint pose-shape refinement).
  • Scene-level integration: investigate joint reconstruction across objects (shared context priors, global consistency constraints) versus purely per-object generation, and quantify benefits/costs.
  • Materials and appearance: extend from geometry-only to materials/texture and evaluate how multimodal conditioning supports appearance reconstruction under casual capture.
  • End-to-end learnability: explore training the detector/point-refinement modules jointly with the generative model to reduce brittleness and improve object localization under clutter.
  • Ethical, privacy, and safety considerations: assess implications of reconstructing personal objects from casual captures and propose safeguards (e.g., on-device processing, anonymization).

Practical Applications

Immediate Applications

Below are concrete use cases that can be deployed today, leveraging ShapeR’s ability to produce metrically accurate, complete object meshes from casually captured, posed multi-view sequences without explicit segmentation.

  • E-commerce and Retail (software, logistics)
    • Auto-generation of 3D product models for marketplace listings from short phone videos; includes real-world scale for AR try-on and accurate shipping dimension estimation.
    • Assumptions/dependencies: requires posed multiview capture (ARKit/ARCore or SfM/SLAM), sufficient views with object visibility, GPU or cloud inference; current model outputs geometry (texture workflows may need an additional pipeline).
  • Interior Design and Real Estate (AR/VR, construction)
    • Room scans decomposed into per-object meshes (sofas, tables, appliances) with consistent metric scale for space planning, virtual staging, and fit checks.
    • Assumptions/dependencies: reliable 3D instance detection on cluttered scenes, minimal object motion during capture, integration with CAD/BIM tools for downstream layout.
  • Robotics Perception Modules (robotics, software)
    • Object-centric mapping for manipulation planning: robots build accurate meshes of target objects from egocentric video, improving grasp planning and collision avoidance.
    • Assumptions/dependencies: egocentric SLAM availability (e.g., VIO), inference latency acceptable for offline/near-real-time use, scene dynamics limited during capture.
  • Insurance and Claims (finance, risk)
    • Rapid household or asset inventories with metrically accurate 3D meshes to document condition and dimensions; supports claim verification and fraud reduction.
    • Assumptions/dependencies: consent and privacy controls for in-home capture, cloud processing governance, texture capture optional but helpful for damage documentation.
  • Manufacturing and QA (industry, CAD)
    • Quick geometry checks: reconstruct parts or assemblies from casual captures and compare meshes to CAD for dimensional deviations and fit assessments.
    • Assumptions/dependencies: adequate view coverage, sufficient image resolution for small details, mesh-to-CAD alignment workflow; texture/material not required for geometry checks.
  • Cultural Heritage and Museums (heritage, education)
    • Digitization of stored collections in cluttered environments without controlled photogrammetry setups; object-centric meshes assembled into scene layouts.
    • Assumptions/dependencies: multi-view access, stable lighting or robust augmentations, export formats compatible with archival pipelines.
  • Warehouse and Field Asset Cataloging (logistics, energy)
    • Object-level digital twins of tools, components, and panels from casual walkthroughs; supports inventory tracking, maintenance planning, and spatial audits.
    • Assumptions/dependencies: repeatable capture routes, object motion limited, integration with asset management systems for IDs/metadata.
  • Game/Film Prop Acquisition (media, software)
    • Rapid mesh acquisition of real props from cluttered sets; reduces manual segmentation and clean-scene requirements in content pipelines.
    • Assumptions/dependencies: pipeline-ready mesh export (OBJ/GLB), optional decimation/retopology steps, texture capture handled by existing tools.
  • Education and STEM Labs (education)
    • Classroom labs where students scan everyday objects to study geometry, scale, and spatial reasoning; supports project-based learning in 3D content creation.
    • Assumptions/dependencies: mobile capture with ARKit/ARCore, institutional cloud/GPU access; minimal need for high-end textures.
  • Research Benchmarking (academia)
    • The ShapeR Evaluation Dataset enables systematic testing of object-centric reconstruction under occlusions and clutter; facilitates ablation studies and method comparisons.
    • Assumptions/dependencies: dataset licensing and availability, code and weights release; alignment with common metrics (Chamfer, F-score, normals).
  • Developer Tools and Workflows (software)
    • “Video-to-Mesh” microservice API and Unity/Unreal plugins that take short multiview captures and return per-object meshes with real-world scale; optional scene composer for assembling objects.
    • Assumptions/dependencies: SDK integration with SLAM/VIO, cloud inference endpoints, standard 3D formats, scene coordinate normalization and rescaling.

Long-Term Applications

These ideas require additional research, scaling, optimization, or productization (e.g., real-time performance, broader device support, texturing/materials, dynamic scenes), but are realistic extensions of ShapeR’s methods and findings.

  • Real-Time On-Device Reconstruction (software, mobile, robotics)
    • Interactive object capture on phones/headsets where meshes update live as users move, enabling immediate AR placement and measurement.
    • Assumptions/dependencies: model distillation/quantization, hardware acceleration (Neural Engines/NPUs), efficient SLAM and detection on-device.
  • Full Scene Understanding and Assembly (AR/VR, robotics, construction)
    • End-to-end scene reconstruction that composes object meshes and layout with semantic labels, ready for digital twin creation and robot task planning.
    • Assumptions/dependencies: robust scene graph inference, improved open-vocabulary detection, temporal consistency, layout optimization across objects.
  • Textured, Material-Aware Assets (media, e-commerce)
    • Coupling ShapeR’s geometry with PBR materials and textures for production-ready assets; supports lighting and relightable content.
    • Assumptions/dependencies: integrated appearance modeling, consistent UVs/retopology, multi-view reflectance estimation.
  • Dynamic Objects and 4D Reconstruction (robotics, media)
    • Handling articulated and deformable objects over time (e.g., doors, tools, soft goods), enabling temporal meshes for manipulation and animation.
    • Assumptions/dependencies: motion segmentation, temporal shape priors, robust tracking under occlusions.
  • AR Occlusion and Physics-Ready World Models (AR/VR, robotics)
    • Accurate object meshes with collision properties and physical parameters for realistic AR occlusion and robot simulation.
    • Assumptions/dependencies: material inference for mass/rigidity, scene-scale calibration, physics engine integration.
  • Construction/BIM Automation (construction, policy)
    • Automatic as-built extraction of furniture, appliances, and fixtures from site walkthroughs; reconciles with BIM to track progress and compliance.
    • Assumptions/dependencies: large-object capture at scale, alignment to building coordinates, change detection, BIM interoperability standards.
  • Smart Home Inventory and Valuation (finance, consumer)
    • Continuous in-home object catalogs with estimated replacement value and warranty linkage; assists insurance underwriting and claims.
    • Assumptions/dependencies: privacy-preserving on-device processing, secure metadata standards, valuation models tied to geometry and brand recognition.
  • City-Scale Object Indexing (geospatial, policy)
    • Object-level maps (benches, signage, infrastructure components) from casual citizen captures; aids maintenance, accessibility audits, and urban planning.
    • Assumptions/dependencies: crowd-sourced data governance, georeferencing, deduplication, municipal privacy and consent frameworks.
  • Assistive Robotics for Elder Care and Accessibility (healthcare, robotics)
    • Object-centric meshes for home assistance tasks (finding and retrieving everyday items) with robust occlusion handling in cluttered environments.
    • Assumptions/dependencies: on-device perception, safety certifications, user consent and data protections, integration with manipulation controllers.
  • Standardization and Policy (policy, cross-sector)
    • Metadata standards for 3D object assets (units, provenance, capture method, confidence scores) and guidelines for privacy-preserving in-home scanning.
    • Assumptions/dependencies: multi-stakeholder collaboration (industry, regulators, academia), audit trails for dataset/model usage, accessible consent UX.
  • Marketplace of User-Generated 3D Assets (software, media)
    • Platforms where users sell/licence accurate, metric 3D models of their objects; supports sustainable reuse and digital content creation.
    • Assumptions/dependencies: content moderation, IP/rights management, quality validation pipelines, interoperability with major 3D tools.
  • Cross-Modal Search and Retrieval (software)
    • Search engines that index meshes generated from casual captures and allow retrieval via text, images, or spatial queries (e.g., “find all toasters with width < 30 cm”).
    • Assumptions/dependencies: scalable mesh indexing, semantic captioning quality, unit-consistent metadata, privacy compliance.
  • Safety-Critical Robotics Verification (policy, robotics)
    • Certification pipelines that verify metric accuracy of reconstructed objects used by robots in safety-critical environments (hospitals, factories).
    • Assumptions/dependencies: standardized accuracy benchmarks, reproducible capture protocols, third-party audits.

Notes across applications:

  • ShapeR’s strength is robust, metric geometry under occlusion and clutter, using multimodal conditioning (posed images, sparse SLAM points, captions). It currently focuses on geometry (SDF-to-mesh), with textures/materials requiring complementary modules.
  • Performance depends on: quality of SLAM/VIO and 3D instance detection, enough diverse viewpoints, limited motion blur, and compute availability. Extension to dynamic scenes, real-time operation, and appearance modeling are active development areas.

Glossary

  • 2D Point Mask Prompting: Using 2D projections of 3D points as masks to indicate which pixels belong to an object, guiding image features during conditioning. "Effect of 2D Point Mask Prompting."
  • 3D Gaussian Splatting (3DGS): A rendering technique that represents scenes with 3D Gaussian primitives for real-time view synthesis. "Recent approaches such as NeRF~\cite{mildenhall2021nerf}, 3DGS~\cite{kerbl20233d}, and their extensions~\cite{chen2022tensorf, muller2022instant, barron2021mip, barron2023zip, yu2024mip} achieve high-fidelity view synthesis but prioritize appearance over geometric accuracy."
  • 3D instance detection: Detecting and localizing individual object instances in 3D space, often via bounding boxes. "Next, we apply 3D instance detection~\cite{straub2024efm3d} to extract object-centric crops from both images and point clouds"
  • Amodal3R: A generative approach that reasons about occluded object parts to produce complete shapes beyond visible regions. "Amodal3R~\cite{wu2025amodal3r}, which extends TRELLIS~\cite{xiang2025structured}, improves robustness by reasoning about occluded regions and generating amodal completions."
  • Aria Synthetic Environments: A synthetic dataset of realistic scenes used to simulate real-world occlusions, interactions, and noise patterns. "we then fine-tune on object-centric crops from Aria Synthetic Environment scenes, which feature realistic image occlusions, SLAM point cloud noise, and inter-object interactions."
  • Chamfer distance: A geometry metric measuring the average closest-point distance between two shapes’ point sets. "Experiments show that ShapeR significantly outperforms existing approaches in this challenging setting, achieving an improvement of 2.7×\times in Chamfer distance compared to SoTA."
  • CLIP: A multimodal model that embeds text and images into a joint space; used here for text conditioning. "Both dual and single stream blocks are modulated with timestep and CLIP~\cite{radford2021learning} text embeddings."
  • Cross-attention: An attention mechanism that relates tokens from one modality/stream to another, used for conditioning and decoding. "The decoder DD predicts signed distance values s=D(z,x)s=D(z, x) for a grid of query points xR3x \in \mathbb{R}^3 through cross-attention with the processed latent sequence."
  • Curriculum learning: A training strategy that stages data complexity to improve generalization and robustness. "Training is conducted in a two stage curriculum learning setup"
  • DINOv2: A pretrained vision transformer used as a frozen backbone to extract robust image tokens. "a frozen DINOv2~\cite{oquab2023dinov2} backbone extracts image tokens"
  • Direct3DS2: A foundation image-to-3D model baseline used for comparison. "Direct3DS2~\cite{wu2025direct3d}"
  • Dora (VecSets variant): A specific VecSets autoencoder variant for learning structured 3D latent representations. "We adopt the Dora~\cite{chen2025dora} variant of VecSets~\cite{zhang20233dshape2vecset} as our latent autoencoder."
  • Dual-single-stream transformer: An architecture combining dual-stream cross-attention stages with single-stream self-attention for multimodal conditioning. "We employ a FLUX.1-like dual-single-stream transformer~\cite{batifol2025flux}"
  • EFM3D: A benchmark and model reference for egocentric foundation models in 3D; used as a scene-centric baseline. "EFM3D~\cite{straub2024efm3d}"
  • FLUX DiT architecture: A diffusion transformer architecture variant used as the backbone for the denoising model. "The ShapeR denoising transformer, built on the FLUX DiT architecture, denoises latent VecSets"
  • Flow matching: A generative training paradigm learning velocity fields to transport noise to data distributions. "These multimodal cues condition a flow-matching~\cite{lipman2022flow} transformer"
  • FoundationStereo: A large-scale stereo depth estimation model used to generate depths for TSDF fusion. "TSDF fusion with FoundationStereo depths~\cite{wen2025foundationstereo}"
  • F-score (F1): A metric combining precision and recall for geometric accuracy at a distance threshold. "F-score (F1) at 1%1\% threshold."
  • Hunyuan3D-2.0: A scaled diffusion-based model for textured 3D asset generation; used as a competitive baseline. "Hunyuan3D-2.0~\cite{zhao2025hunyuan3d}"
  • MarchingCubes: An algorithm that extracts polygonal meshes from scalar fields like SDFs. "S^i=Rescale(MarchingCubes(D(z0)),Pi),\hat{S}_i = \text{Rescale}\big(\text{MarchingCubes}(D(z_0)), P_i\big),"
  • MIDI3D: A scene-level generative model that predicts object geometry and layout from imagery. "We also compare against scene-level reconstruction methods, MIDI3D~\cite{huang2025midi}"
  • NeRF: Neural Radiance Fields; a method representing scenes for high-quality view synthesis. "Recent approaches such as NeRF~\cite{mildenhall2021nerf}, 3DGS~\cite{kerbl20233d}, and their extensions"
  • Normal Consistency (NC): A metric evaluating alignment of surface normals between reconstructed and reference shapes. "Chamfer 2\ell_2 Distance (CD), Normal Consistency (NC) and F-score (F1) at 1%1\% threshold."
  • Normalized device coordinate (NDC) cube: The canonical coordinate range used for normalization in graphics, typically [1,1]3[-1,1]^3. "each object’s point cloud PiP_i is normalized to the normalized device coordinate cube [1,1]3[-1,1]^3."
  • Pl\"ucker encodings: A representation of rays (lines) in 3D used to encode camera poses and viewing geometry. "poses using Pl\"ucker encodings"
  • Rectified flow: A variant of flow-based generative modeling that simplifies training by rectifying trajectories. "ShapeR formulates object-centric shape generation as a rectified flow process"
  • ResNet (3D sparse-convolutional): A residual network adapted with sparse 3D convolutions for point cloud encoding. "a ResNet~\cite{he2016deep} style 3D sparse-convolutional encoder"
  • SAM2: A segmentation model used to refine object point sets by removing spurious samples. "refined within its bounding box using SAM2~\cite{ravi2024sam} to remove spurious samples"
  • SceneGen: A multi-view scene-level generative model producing object geometry and spatial layout. "SceneGen~\cite{meng2025scenegen}"
  • SDF (Signed Distance Function): A scalar field giving the distance to a surface, with sign indicating inside/outside; used for implicit shape representation. "The denoised latent is decoded into a SDF, from which the final object shape is extracted using marching cubes."
  • SfM (Structure from Motion): A method that reconstructs 3D structure and camera poses from images alone. "The point cloud is derived from images using SLAM or SfM;"
  • SLAM (Simultaneous Localization and Mapping): A method estimating camera trajectory and building a map (e.g., sparse points) from sensor data. "off-the-shelf visual-inertial SLAM~\cite{engel2017direct}"
  • Sparse-convolutional encoder (3D): A network using sparse 3D convolutions to efficiently process point clouds. "a ResNet style 3D sparse-convolutional encoder downscales the point features into a token stream."
  • TSDF fusion: Integrating depth maps into a volumetric truncated signed distance field to reconstruct geometry. "TSDF fusion with FoundationStereo depths~\cite{wen2025foundationstereo}"
  • Variational Autoencoder (VAE): A probabilistic autoencoder that learns latent distributions for generative modeling. "3D Variational Autoencoder."
  • VecSets: A latent 3D shape representation as variable-length sequences of tokens encoding geometry. "VecSets~\cite{zhang20233dshape2vecset}"
  • Vision-LLM (VLM): A model jointly processing images and text to produce captions or embeddings. "generate text captions using vision-LLMs~\cite{meta2025llama}"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 112 likes about this paper.