Papers
Topics
Authors
Recent
2000 character limit reached

SAM 3D: 3Dfy Anything in Images (2511.16624v1)

Published 20 Nov 2025 in cs.CV and cs.AI

Abstract: We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

Summary

  • The paper introduces a novel end-to-end model that generates detailed 3D object geometry, textures, and layouts directly from single images using a two-stage architecture.
  • It employs a multi-stage generative training pipeline with synthetic pretraining, render-paste augmentation, and MITL alignment to overcome challenges like occlusion and clutter.
  • The model significantly outperforms baselines in metrics such as F1, vIoU, Chamfer distance, and joint shape+pose accuracy, advancing general-purpose 3D perception.

SAM 3D: End-to-End 3D Object Reconstruction from Images


Overview

"SAM 3D: 3Dfy Anything in Images" (2511.16624) introduces SAM 3D, a foundation model for comprehensive 3D object reconstruction directly from single 2D images. The model predicts full 3D geometry, texture, and scene layout, handling extreme occlusion, clutter, and complex context within natural scenes. SAM 3D is enabled by a novel multi-stage data engine and human/model-in-the-loop (MITL) annotation pipeline, facilitating scalable acquisition of 3D supervision at unprecedented volume and diversity. The model leverages curriculum-inspired, multi-stage generative training that mirrors recent advances in LLM alignment and scaling.


Model Architecture and Pipeline

SAM 3D employs a two-stage architecture utilizing contemporary advancements in latent flow matching and mixture-of-transformers (MoT):

  • Input Encoding: DINOv2 extracts multi-scale features from object crops, corresponding segmentation masks, and full-scene context; optional pointmap conditioning (e.g., LiDAR, monocular depth estimation) is also supported.
  • Geometry Model: A 1.2B-parameter flow transformer in MoT configuration jointly predicts pose and coarse shape (voxel-based, 64364^3 resolution), capturing global and local spatial correlations.
  • Texture Refinement: A 600M-parameter sparse latent flow transformer processes active voxels for high-fidelity geometric detail and texture synthesis, with outputs decoded via mesh or 3D Gaussian splat VAEs.
  • Multi-Modal Prediction: The system learns to invert the 3D-to-2D mapping, modeling p(S,T,R,t,sI,M)p(S, T, R, t, s | I, M) generatively for shape SS, texture TT, pose RR, translation tt, and scale ss, given image II and mask MM.

Data Engine and Annotation Strategy

The MITL data engine solves the core bottleneck of 3D data sparsity by synergizing synthetic pretraining, semi-synthetic render-paste augmentation, and iterative human/model preference alignment:

  • Synthetic Pretraining (Iso-3DO): 2.7M Objaverse-XL object meshes rendered from 24 viewpoints, covering 2.5T training tokens, yield object-centric crop data to bootstrap shape/texture priors.
  • Render-Paste Mid-Training (RP-3DO): 61M samples (occluders, occludees, object replacements) blend synthetic assets into real scenes with precise mask/pose mapping, augmenting the model’s robustness to layout and occlusion.
  • Human/Model-in-the-Loop Post-Training: Preference-based selection for 3.14M shapes (and 100K textures), routed to expert 3D artists when all models fail (Art-3DO). The resulting SA-3DAO benchmark contains 1,000 high-fidelity artist-meshes paired to complex natural images.
  • Expert Iteration and Data Flywheel: Annotators operate on best-of-NN candidate sets, continuously improving model proficiency through staged SFT and DPO on quality-controlled data, analogous to RLHF/RAFT/ExIT for policy amplification.

Training Paradigm

The multi-stage training mirrors efficient LLM pre/post-training protocols:

  1. Pretraining: Isolated synthetic objects (SS, TT, RR) on object-centric crops.
  2. Mid-Training: Render-paste variants (SS, RR, tt, ss) on whole images and pointmaps, addressing occlusion, mask-following, and layout estimation.
  3. Post-Training: SFT on MITL/Art-3DO data; DPO aligns the generative model to human preference, leveraging negative candidate sets for reward modeling; additional distillation reduces inference steps (shortcut models via consistency objectives).

Conditional rectified flow matching is used for multi-modality training. All stages were large-scale: pretraining/mid-training consumed up to 2.7T tokens. Post-training iteratively refines alignment using preference datasets, expert annotations, and high-threshold quality selection.


Evaluation and Empirical Results

Quantitative Performance

SAM 3D drastically outperforms contemporary baselines (Trellis, Hunyuan3D, Direct3D-S2, Hi3DGen, TripoSG) on real-world shape and layout reconstruction:

  • SA-3DAO [email protected]: 0.2344 (SAM 3D) vs. 0.1475–0.1629 (others)
  • vIoU: 0.2311 (SAM 3D) vs. 0.1266–0.1531 (others)
  • Chamfer: 0.0400 (SAM 3D) vs. 0.0844–0.1126 (others)
  • EMD: 0.1211 (SAM 3D) vs. 0.2049–0.2432 (others)
  • Human Preference Win-Rate: At least 5:1 (object), 6:1 (scene) in pairwise preference tests across several benchmarks (SA-3DAO, LVIS, MetaCLIP).

For 3D scene layout on SA-3DAO and ADT, SAM 3D enjoys clear margin improvements:

  • Joint Shape+Pose Prediction: ADD-S @ 0.1 jumps from <2% (pipeline baselines) to >77% (SAM 3D).
  • ICP-Rotation Error and IoU: Markedly lower error after joint optimization and post-hoc render-and-compare refinement.

Ablations

Cascade training ablation demonstrates monotonic improvement with each additional data and alignment stage; removing MITL, Art-3DO, or DPO significantly degrades final accuracy. Texture ablations highlight the necessity of RP-3DO data, high-aesthetics selection, and lighting augmentation. Shortcut-model distillation enables sub-second generation with minimal fidelity loss.


Practical and Theoretical Implications

SAM 3D's robust single-image-to-3D reconstruction enables several actionable developments:

  • General-Purpose 3D Perception: The foundation model paradigm unlocks direct, layout-aware, textured 3D asset creation for robotics, AR/VR, gaming, digital twin, and embodied AI applications—without reliance on multi-view, SLAM, or hand-crafted priors.
  • Data Engine as Scalable Supervision: The MITL alignment and amplification loop establishes a template for bootstrapping supervision in sparse domains (e.g., 3D physical understanding, scene graph construction, physics modeling), leveraging synthetic data, human preference, and automated reward modeling.
  • Model Scaling Laws: Methodological adoption of LLM training recipes corroborates scaling laws for transfer in generative perception, suggesting further gains should be attainable with continued data expansion and iterative alignment.
  • Emergent Properties: Multi-modal, mixture-of-transformers architectures flexibly capture compositional scene structure, permitting modular extension to part-based, hierarchical, or implicit 3D representations.

Limitations and Future Directions

SAM 3D’s fixed output resolution (64364^3 voxel grid; 32 splats/voxel) constrains fidelity for highly detailed, small-scale objects; architectural superresolution, implicit field modeling, or parts-based approaches can remedy this. Object interactions, physical contact, and global scene consistency remain outstanding, as layout estimation is per-object and does not incorporate physical constraints across multiple objects. Texture prediction remains agnostic to pose, occasionally producing rotationally inconsistent output for symmetric objects. Expanding domain coverage, enabling physics-aware reasoning, and integrating cross-modality inference will further enhance generalization and applicability.


Conclusion

SAM 3D establishes a new state-of-the-art for visually grounded, single-image 3D object and scene reconstruction. By integrating a scalable, preference-aligned data engine and a modern generative modeling pipeline, it sets a new bar for 3D perception in naturalistic environments. The model, data, and benchmarks are positioned to accelerate research across robotics, embodied AI, and content creation, and to catalyze further theoretical advances in general-purpose 3D foundation models.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

SAM 3D: Turning a Single Photo into a 3D Scene

Overview

This paper introduces SAM 3D, an AI system that can take one regular photo and turn the objects in it into full 3D models. It figures out each object’s shape, colors and textures, and where it sits in the scene (its position, size, and rotation). The goal is to make believable 3D scenes from everyday pictures—even when objects are partly hidden or surrounded by clutter.

What questions does the paper try to answer?

  • Can we build a model that creates useful 3D objects from just one image, not a whole video or many photos?
  • Can it handle real-life scenes where objects overlap, are far away, or partly hidden?
  • How can we train such a model when there isn’t much real, labeled 3D data connected to everyday photos?
  • Will this approach beat the best existing methods on both quality and usefulness?

How does SAM 3D work?

Think of the system like a two-step art process: first sculpting, then painting.

  • Step 1: Geometry (the “sculptor”)
    • The model looks at your photo twice: once zoomed in on the object and once as the whole scene. This helps it use both fine details and big-picture context.
    • It predicts a rough 3D shape and how the object is placed in the scene: its rotation (which way it’s facing), how big it is, and where it is in 3D.
  • Step 2: Texture and detail (the “painter”)
    • Starting from the rough shape, a second model adds finer details and realistic textures and colors.
    • The result can be turned into different 3D formats, like standard meshes used in games or “Gaussian splats,” which are another way to render 3D.

To make this work well in the real world, they used a training recipe in three phases, similar to how LLMs are trained:

  • Pretraining (practice with “toy worlds”)
    • The model first learns with many synthetic 3D objects (computer-made). These are easy to label and help the model learn a “vocabulary” of shapes and materials.
  • Mid-training (mixing real and fake)
    • They paste these synthetic 3D objects into real photos. This teaches the model tricky skills, like:
    • Mask-following: “Only rebuild the object outlined here.”
    • Occlusion: finish shapes even when parts are hidden by other objects.
    • Layout: estimate position, size, and rotation in a real scene.
  • Post-training (real-world alignment with people in the loop)
    • Model-in-the-loop: The AI suggests several 3D guesses. Human annotators pick the best one and adjust it if needed. If none are good, professional 3D artists make a correct version.
    • Preference tuning: The model learns not just from “right answers” but also from which outputs people prefer. This nudges it toward more realistic, symmetric, and complete shapes.
    • Speed-up: They “distill” the model so it can make results much faster, in just a few steps.

Helpful terms:

  • Mask: the outline of the object in the image.
  • Pose/layout: where an object is, how big it is, and which way it’s turned.
  • Occlusion: when parts of an object are hidden behind something else.
  • Point map (optional): a rough 3D map from a phone or depth estimator that helps place the object correctly.

What did they find, and why does it matter?

  • Much higher quality on real photos: In head-to-head tests with people judging the results, SAM 3D’s 3D objects were preferred about 5 times as often as other leading methods.
  • Better numbers on tough benchmarks: On their new real-world test set (SA-3DAO), SAM 3D beat other methods by a large margin on shape accuracy.
  • Full scenes, not just single objects: It reconstructs several objects in a scene and places them correctly, doing better than both “pipeline” approaches and other all-in-one models.
  • Works in messy, real-life images: Because it uses both close-up details and whole-scene context, it handles clutter and hidden parts better than past systems.

Why this matters:

  • You can re-view or reuse objects from a single photo in 3D—spin them around, move them, or place them in new scenes.
  • This can speed up creating 3D content for games, AR/VR, movies, and design.
  • Robots and AR apps can better understand the real world from a single picture.

What’s the bigger impact?

  • Breaking the 3D data barrier: There aren’t many real photos paired with correct 3D models. SAM 3D works around this by:
    • Practicing with synthetic data,
    • Learning from semi-real scenes,
    • And finally using humans to pick and refine the best outputs, then feeding that back into training.
  • A virtuous cycle: As the model improves, it produces better candidates. Humans then need to fix less, which creates more high-quality training data faster.
  • Community resources: The team will release code, models, an online demo, and a new real-world benchmark, so others can build on this work.

In short, SAM 3D is a step toward turning everyday photos into usable 3D worlds, with a training approach that cleverly combines computers and people to overcome the lack of real 3D-labeled data.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of gaps, limitations, and unresolved questions that future work could address.

  • Multi-object evaluation and scene consistency: SA-3DAO is mostly single-object; there is no rigorous benchmark for multi-object scenes measuring global consistency (e.g., inter-object occlusion reasoning, collisions, support relations, and scene-level physical plausibility).
  • Absolute scale and camera calibration: The model predicts translation/scale in normalized camera coordinates; it is unclear how reliably SAM 3D recovers metric scale without known intrinsics or high-quality depth. Quantify performance under unknown/incorrect intrinsics and camera distortion.
  • Depth/pointmap sensitivity: Layout depends on pointmaps (monocular or sensor-based). Provide sensitivity analyses to depth errors, noise, sparsity, and misalignment, including failure modes and fallback strategies for RGB-only.
  • Mask dependency and robustness: Reconstruction hinges on a supplied mask M; evaluate robustness to mask errors (inaccurate boundaries, missing parts, over-segmentation), and the performance of an end-to-end pipeline that must detect and segment objects automatically.
  • Diversity and uncertainty in occluded regions: The model provides a single reconstruction; characterize and expose uncertainty or multi-modal predictions for occluded/invisible geometry and pose (e.g., via sampling, ensembles, or probabilistic outputs).
  • Texture data scarcity and materials: Only ~100K textured meshes vs ~3.14M untextured. Assess how texture scarcity limits generalization; expand and evaluate physically-based materials (albedo/normal/roughness/metalness) rather than baked diffuse textures.
  • Illumination modeling: The method reconstructs textures but not scene lighting or reflectance. Investigate joint estimation of illumination/materials to enable physically correct re-rendering under novel lighting.
  • Challenging material classes: Evaluate and improve handling of transparent/reflective objects, emissive surfaces, thin structures (e.g., wires), and complex micro-geometry (e.g., fabric).
  • Deformable/non-rigid objects: The approach targets static shape; quantify performance on deformable objects and propose mechanisms for modeling non-rigidity from a single image.
  • Output representation trade-offs: Mesh vs. 3D Gaussian decoders share latents, but comparative quality, speed, memory, and editability are not analyzed. Provide systematic evaluations and guidelines for representation choice across tasks.
  • Physical plausibility metrics: Beyond geometric errors, introduce metrics for physical realism (no interpenetration, correct ground contact, gravity-consistent placement) and evaluate SAM 3D against them.
  • Scene layout optimization: The paper notes sample-then-opt improves layout; quantify gains, runtime, and failure cases across diverse scenes, and standardize post-optimization protocols.
  • Encoder dependence: The model relies on DINOv2 features; ablate alternative encoders or 3D-aware pretraining, and paper fine-tuning strategies that best transfer to 3D reconstruction.
  • Domain generalization: Evaluate and adapt to domains beyond web photos (e.g., industrial scenes, medical imagery, underwater, nighttime, extreme weather), including OOD detection and robustness.
  • Data distribution and bias: The “3D-oriented taxonomy” and licensed sources are not characterized. Audit object/scene distribution, cultural/geographic biases, and the impact on downstream fairness.
  • Human preference alignment risks: DPO and best-of-N human selection could encode aesthetic biases; report annotator demographics, rubric versions, inter-rater agreement, and analyze potential reward hacking or regressions in non-preferred yet correct outputs.
  • Scalability and cost of MITL pipeline: Quantify annotation time, cost, and throughput, especially for artist-routed “hard cases,” and provide guidance for reproducing the pipeline outside large organizations.
  • SA-3DAO ground truth reliability: Artist meshes are subjective and time-intensive; measure inter-artist agreement, annotation variance, and how GT imperfections affect metric validity and training.
  • Texture evaluation metrics: Texture is primarily judged via human preference; develop standardized quantitative metrics (e.g., render-based perceptual scores under controlled lighting) and correlate them with human judgments.
  • Multi-view/video inputs: The system targets single images; explore how limited additional views or short video clips could improve geometry/texture/layout while maintaining efficiency.
  • Large structures and multi-scale modeling: Churches and large assets are mentioned, but multi-scale accuracy, memory constraints, and scene extent limits are not characterized. Establish benchmarks and techniques for very large objects.
  • Runtime, efficiency, and deployment: Distillation reduces NFE from 25→4, but end-to-end latency, GPU/CPU requirements, and on-device feasibility for robotics/AR are not reported. Provide detailed performance profiles and model size variants.
  • Downstream robotics metrics: Evaluate impact on grasping/manipulation, navigation, and AR placement success rates, not just geometric/layout metrics, and paper failure cases relevant to safety-critical tasks.
  • Preference dataset evolution: The quality threshold α and rubric evolve; quantify the effect of changing selection criteria on model behavior, data drift, and reproducibility across iterations.
  • Best-of-N search scaling: The data engine relies on candidate filtering and human selection; analyze the trade-off between N, quality gains, and cost, and develop principled stopping criteria or automatic candidate pruning.
  • Handling partial/ambiguous masks: Investigate strategies for objects split across masks, overlapping instances, or ambiguous boundaries (e.g., transparency), and their effects on layout and shape completion.
  • Camera pose and intrinsics recovery: Propose and evaluate methods to infer camera intrinsics/extrinsics from a single image to reduce ambiguity in metric layout and enable consistent AR insertion.
  • Failure mode cataloging: Provide quantitative rates and typologies of common failures (e.g., “floaters,” bottomless meshes, symmetry errors) post-alignment, and targeted training or constraints to mitigate them.
  • Data licensing and release: Clarify licensing of SA-3DAO and other datasets, redistribution restrictions, and how researchers can legally obtain comparable training/evaluation data.
  • Environmental and compute costs: Report training token counts (2.5T+2.7T+0.5T) alongside compute resources, energy use, and carbon footprint; paper scaling laws and efficient training recipes for smaller budgets.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging the released demo, code, and model weights, along with standard tooling in 3D content creation, robotics, and AR.

  • Photo-to-3D asset creation for creative industries — sectors: software, gaming, film, advertising
    • Tools/workflow: ingest an image and mask (from SAM), run SAM 3D to produce a textured mesh or 3D Gaussian splats, import into Blender/Maya/Unity/Unreal; optional human-in-the-loop QC (artists or reviewers) for aesthetic alignment.
    • Assumptions/dependencies: clear object segmentation masks; rights clearance for source photos; basic scene context to avoid incorrect completion for transparent/specular objects; compute for inference (distilled geometry model supports sub-second shape/layout).
  • E-commerce product pages and AR “view in your space” — sectors: retail, software
    • Tools/workflow: convert seller photos into 3D assets for 360° viewers; enable AR previews (e.g., furniture “place in room”); batch-processing pipeline with mask selection and automatic texture generation; optional pointmap/depth for better scale/placement.
    • Assumptions/dependencies: clean masks of the target object; optional calibration for accurate scale (reference objects or device LiDAR); brand/legal approvals for generated geometry and textures.
  • Interior design and real estate staging from room photos — sectors: real estate, construction, software
    • Tools/workflow: reconstruct per-object shape and layout from a single interior photo, place assets in a composable scene, let users move/replace objects virtually; link to BIM or staging tools; benefit from iPhone LiDAR or monocular depth to improve layout accuracy.
    • Assumptions/dependencies: availability of pointmaps or reliable monocular depth for better translation/scale; texture realism may need human QC for premium listings.
  • Robotics perception for picking and manipulation in clutter — sectors: robotics, logistics, manufacturing
    • Tools/workflow: capture RGB(+depth), use SAM 3D to produce object meshes and 6D pose in context; optionally run sample-then-optimize (render-and-compare) to refine layout; pass meshes to grasp planners and motion planners in ROS.
    • Assumptions/dependencies: inference latency within action budget; reflective/transparent objects may still require sensor fusion; safety validation; calibration of camera intrinsics/extrinsics; constrained rollout in pilot cells.
  • Enterprise asset catalogs and digital twins from photos — sectors: software, enterprise IT
    • Tools/workflow: batch convert field photos into 3D assets; attach metadata (SKU, brand, dimensions) and store in a DAM/PIM; surface assets in internal configurators, training simulators, or AR field tools.
    • Assumptions/dependencies: scale alignment (dimension inference) may need references or device depth; governance of model outputs and provenance; versioning of generated assets.
  • Academic benchmarking and 3D research — sectors: academia, software
    • Tools/workflow: adopt SA-3DAO for real-world single-image 3D benchmarking; evaluate single-view shape, texture, and layout; replicate model-in-the-loop data engines; paper preference alignment (DPO) for 3D.
    • Assumptions/dependencies: agree on metrics (e.g., Chamfer/EMD/ADD-S, preference protocols); consistent masking and evaluation protocols; community buy-in and reproducibility.
  • Curriculum development and teaching materials on single-image 3D — sectors: education
    • Tools/workflow: create interactive notebooks/labs demonstrating pictorial cues, occlusion handling, and layout estimation using SAM 3D; compare synthetic pretraining vs. real-world post-training; visualize multi-stage training impacts.
    • Assumptions/dependencies: access to GPUs for classroom demos; curated examples to avoid failure modes (e.g., extreme transparency).
  • Insurance claims assistance (assistive triage) — sectors: finance/insurance
    • Tools/workflow: reconstruct damaged items from claimant photos to support adjusters with 3D views; flag ambiguous geometry for human review; enrich reports with exploded 3D views of impacted components.
    • Assumptions/dependencies: non-deterministic reconstructions must be treated as assistive evidence only; human oversight; disclaimers on accuracy; compliance with jurisdictional regulations.
  • AR creative mobile apps — sectors: software, consumer
    • Tools/workflow: integrate the SAM 3D API into mobile apps; turn a photo into an interactive 3D object instantly; enable social sharing and remixing; optionally leverage device LiDAR for better placement in AR scenes.
    • Assumptions/dependencies: on-device optimization and/or cloud inference; content moderation for public sharing; battery/latency constraints.
  • Synthetic data generation for CV/robotics training — sectors: academia, software, robotics
    • Tools/workflow: reuse render-paste and model-in-the-loop paradigms to manufacture occlusion-heavy training sets for detection/segmentation/pose; leverage preference-aligned assets to reduce undesirable artifacts.
    • Assumptions/dependencies: careful domain mixing to avoid overfitting; labeling/QC to ensure realism; dataset governance and licensing of assets.

Long-Term Applications

These opportunities likely require further research, productization, scaling, validation, or hardware advances before broad deployment.

  • Real-time, on-device scene-level 3D for AR/MR glasses — sectors: software, hardware, consumer
    • Tools/workflow: continuous single-image-to-3D reconstruction with temporal coherence; world-locked object meshes for persistent AR; lightweight shape/layout refinement on edge devices.
    • Assumptions/dependencies: substantial model compression; efficient incremental updates; robust handling of dynamic scenes and extreme occlusion; privacy-friendly on-device processing.
  • Warehouse automation and retail inventory from shelf photos — sectors: robotics, logistics, retail
    • Tools/workflow: monocular capture of aisles; reconstruct product shapes and poses at scale; plan picking/replenishment; integrate with WMS/ERP; automatic mismatch detection vs. planograms.
    • Assumptions/dependencies: reliable scale recovery and pose accuracy across lighting/packaging; safety certification; long-tail generalization (rare SKUs, packaging variants); consistent camera placement and calibration.
  • Autonomous navigation and embodied AI with richer single-view 3D — sectors: robotics, mobility
    • Tools/workflow: fuse single-view 3D reconstructions with SLAM/NeRF/VO pipelines; inform cost maps and affordance maps; fast mesh-based collision models.
    • Assumptions/dependencies: rigorous validation under motion blur, low light, weather; multi-sensor fusion (LiDAR, radar); probabilistic uncertainty modeling; standards for safety-critical use.
  • Medical and orthotics workflows (patient-specific 3D from photos) — sectors: healthcare
    • Tools/workflow: preliminary shape modeling for casts/prosthetics from 2D photos in clinics or at home; accelerate custom fitting and simulation.
    • Assumptions/dependencies: clinical validation against ground truth (CT/MRI/structured light scans); accuracy thresholds; regulatory approval; patient privacy; multi-view requirements for sensitive anatomy.
  • City-scale digital twins through crowd-sourced photos — sectors: smart cities, construction, GIS
    • Tools/workflow: aggregate citizen images to reconstruct urban objects (street furniture, facades) as 3D assets; integrate into municipal GIS, planning simulators, and accessibility audits.
    • Assumptions/dependencies: deduplication/merging across views; public-space privacy policies; IP/licensing for images; photogrammetry hybridization; quality control for safety use.
  • Forensics and accident reconstruction from single frames — sectors: public safety, law
    • Tools/workflow: reconstruct key objects/vehicles from surveillance stills; simulate alternative viewpoints; support expert testimony and investigation.
    • Assumptions/dependencies: strict validation and uncertainty quantification; chain-of-custody; standards to avoid overinterpretation of hallucinated geometry; court admissibility frameworks.
  • Wildlife morphology and ecological monitoring from field photos — sectors: environment, research
    • Tools/workflow: estimate 3D morphology of animals/plants from single images to track growth or health; generate synthetic observations for training detectors.
    • Assumptions/dependencies: species-specific validation; bias assessments (occlusion, camouflage); careful use where texture priors may mislead; multi-view augmentation.
  • Industrial inspection and retrofits (pipes, valves, fixtures) — sectors: energy, manufacturing
    • Tools/workflow: reconstruct hard-to-measure components from photos to aid maintenance and retrofits; pre-plan interventions with approximate 3D.
    • Assumptions/dependencies: accuracy under specular/metallic surfaces; scale calibration; safety constraints; combining with structured-light or depth sensors.
  • Generalist 3D search and marketplace (“search by photo → 3D asset”) — sectors: software, creator economy
    • Tools/workflow: blend retrieval with generation: index meshes; when retrieval is insufficient, generate and refine; enable licensing and asset provenance tracking.
    • Assumptions/dependencies: rights management; asset similarity metrics; scalable preference alignment to reduce failure cases; content provenance/watermarking.
  • Policy frameworks for 3D content provenance, privacy, and safety — sectors: policy, standards
    • Tools/workflow: define disclosure standards for generated 3D (watermarks, model cards), privacy rules for scanning public spaces, and accuracy disclaimers in high-stakes domains; promote benchmarks (e.g., SA-3DAO) for accountability.
    • Assumptions/dependencies: cross-industry cooperation; regulatory engagement; open evaluation protocols; mechanisms to flag uncertainty and limit harmful use.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • 2.5D: A representation capturing partial 3D information (e.g., visible surfaces) from a single viewpoint, lacking full volumetric detail. "not just of the visible 2.5D surface"
  • 3D Gaussian splats: A 3D representation using collections of Gaussian primitives for fast rendering and reconstruction. "either mesh or 3D Gaussian splats via a pair of VAE decoders Dm\mathcal D_m, Dg\mathcal D_g."
  • 3D IoU: Intersection over Union computed in 3D space to measure overlap between predicted and ground-truth volumes. "3D IoU (\uparrow)"
  • 6D rotation: A continuous rotation parameterization using six values to avoid singularities common in Euler angles. "the 6D rotation~\citep{zhou2019continuity}"
  • ADD-S: Average Distance of model points (symmetric), a pose accuracy metric measuring distances between points under predicted and ground-truth transforms. "ADD-S (\downarrow)"
  • ADD-S @ 0.1: Thresholded ADD-S metric indicating the fraction of predictions below 10% of object diameter; higher is better. "ADD-S @ 0.1 2%77%2\% \rightarrow 77\%"
  • alpha compositing: A graphics technique for blending a rendered object into an image using transparency (alpha) values. "We construct our data by rendering textured meshes into natural images using alpha compositing."
  • Aria Digital Twin (ADT): A real-world dataset with depth/pointmaps used to evaluate object layout and pose. "We also include ISO3D from 3D Arena~\citep{ebert20253d} for quantitatively evaluating shape and texture, and Aria Digital Twin (ADT)~\citep{pan2023aria} for layout."
  • Art-3DO: A curated subset of hard cases annotated directly by professional 3D artists for high-quality supervision. "We route a small percentage of these hardest cases to professional 3D artists for direct annotation, and we denote this set Art-3DO."
  • best-of-N search: A selection strategy where multiple candidates are generated and the best is chosen to maximize output quality. "a form of best-of-NN search~\citep{ouyang2022training} using humans."
  • Chamfer: The Chamfer distance, a metric comparing two point sets (e.g., shapes) by average nearest-neighbor distances. "Chamfer (\downarrow)"
  • cross-entropy method: An optimization technique that iteratively focuses sampling on the best-performing candidates to improve solutions. "similar to the cross-entropy method for optimization~\citep{2005crossentropymethod}."
  • DINOv2: A self-supervised vision transformer used as an image encoder to extract robust visual features. "We use DINOv2~\citep{oquab2023dinov2} as an encoder"
  • Direct Preference Optimization (DPO): A post-training method aligning model outputs with human preferences using pairwise comparisons, without reinforcement learning. "direct preference optimization (DPO)~\citep{rafailov2023direct}"
  • distillation: A training process transferring knowledge from a slower teacher to a faster student to reduce inference steps. "we finish a short distillation stage to reduce the number of function evaluations (NFE) required during inference from 25425\rightarrow4."
  • Earth Mover’s Distance (EMD): A metric measuring the minimal cost to transform one distribution (e.g., point cloud) into another. "EMD (\downarrow)"
  • Elo: A rating system quantifying comparative performance; differences map to win probabilities in preference tests. "a 400 point Elo difference corresponds to 10:1 odds in a preference test."
  • [email protected]: F1 score computed at a specific threshold (0.01), balancing precision and recall for geometric evaluation. "[email protected] (\uparrow)"
  • flow transformer: A transformer model trained with flow matching to learn generative distributions (e.g., over shapes/poses). "we employ a 1.2B parameter flow transformer with the Mixture-of-Transformers (MoT) architecture"
  • FoundationPose: A pose estimation method used in pipeline comparisons for 3D object alignment. "Pipeline & HY3D-2.0 + FoundationPose"
  • ICP-Rot.: Rotation error derived from Iterative Closest Point alignment, assessing angular discrepancy between poses. "ICP-Rot. (\downarrow)"
  • Iso-3DO: A large dataset of isolated synthetic 3D objects rendered from multiple views for pretraining. "We call this dataset Iso-3DO and train for 2.5 trillion training tokens."
  • LiDAR: A sensor that measures distance using laser pulses to produce depth/pointmaps aiding layout estimation. "via hardware sensors (e.g, LiDAR on an iPhone)"
  • Megapose: A pose estimation framework used in pipeline baselines for scene layout. "Pipeline & Trellis + Megapose"
  • MIDI: A joint generative model for multi-object alignment used in layout comparisons. "Joint & MIDI"
  • Mixture-of-Transformers (MoT): An architecture that combines multiple transformer experts, often with specialized attention masks. "with the Mixture-of-Transformers (MoT) architecture~\citep{liang2025mixtureoftransformers,deng2025Bagel}"
  • model-in-the-loop (MITL): A data collection paradigm where model proposals guide human annotation, enabling iterative improvement. "using both a novel model-in-the-loop (MITL) pipeline and human 3D artists"
  • Objaverse-XL: A large-scale dataset of 3D assets used to render training pairs for pretraining. "Objaverse-XL~\citep{deitke2023objaverse}"
  • pictorial cues: Visual hints (e.g., shading, texture) in a single image that inform 3D shape perception. "then refines the shapes by integrating pictorial cues (see~\Cref{fig:method})."
  • pointmap: A per-pixel map of 3D points (depth) providing geometric scene context for pose estimation. "We conduct experiments with a Geometry model that is trained to condition on pointmaps."
  • ProcThor: A simulated environment/dataset used to train layout estimation capabilities. "ProcThor, RP-3DO^\ddagger"
  • rectified conditional flow matching: A training objective for generative models that matches conditional flows to target distributions. "rectified conditional flow matching~\citep{liu2022flow}"
  • render-and-compare: A pose refinement strategy that renders predictions and compares them to the input to improve alignment. "as in the render-and-compare approaches~\citep{labbe2022megapose,wen2024foundationpose}"
  • render-paste (RP-3DO): A semi-synthetic dataset produced by compositing rendered 3D assets into real images with masks and occlusions. "We call this data RP-3DO; it contains $61$ million samples with $2.8$ million unique meshes"
  • SA-3DAO: A benchmark of expert-annotated real-world image–3D pairs to evaluate shape and layout. "SAM 3D Artist Objects (SA-3DAO)."
  • SLAM: Simultaneous Localization and Mapping; a classical approach to reconstruct scenes and camera trajectories. "SLAM~\citep{smith1990estimating,castellanos1999spmap}."
  • sparse latent flow transformer: A transformer variant operating on sparse latent tokens trained via flow matching for detailed reconstruction. "A 600M parameter sparse latent flow transformer~\citep{xiang2025structured,peebles2023scalable} refines geometric details and synthesizes object texture."
  • structured latent space: A learned latent representation shared across decoders (e.g., mesh and splats) to enable multi-format outputs. "share the same structured latent space~\citep{xiang2025structured}"
  • Trellis: A state-of-the-art image-to-3D baseline model used for comparison. "We compare to the recent Trellis~\citep{xiang2025structured}"
  • ULIP: A perceptual similarity metric relating 3D assets and images via unified representation learning. "ULIP (\uparrow)"
  • Uni3D: A perceptual metric evaluating similarity between 3D reconstructions and input images. "Uni3D (\uparrow)"
  • VAE decoders: Variational Autoencoder decoders that map latents to explicit 3D representations (mesh or splats). "via a pair of VAE decoders Dm\mathcal D_m, Dg\mathcal D_g."
  • vIoU: Volumetric Intersection over Union, measuring overlap between volumetric reconstructions. "vIoU (\uparrow)"
Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 6 tweets and received 61 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com