SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling (2512.05343v1)

Published 5 Dec 2025 in cs.CV and cs.AI

Abstract: Generative methods for 3D assets have recently achieved remarkable progress, yet providing intuitive and precise control over the object geometry remains a key challenge. Existing approaches predominantly rely on text or image prompts, which often fall short in geometric specificity: language can be ambiguous, and images are cumbersome to edit. In this work, we introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D generation. Our approach accepts a wide range of geometric inputs, from coarse primitives to detailed meshes, and integrates seamlessly with modern pre-trained generative models without requiring any additional training. A controllable parameter lets users trade off between geometric fidelity and output realism. Extensive quantitative evaluation and user studies demonstrate that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality. Finally, we present an interactive user interface that enables online editing of superquadrics for direct conversion into textured 3D assets, facilitating practical deployment in creative workflows. Find our project page at https://spacecontrol3d.github.io/

Summary

The paper introduces a training-free test-time method that injects user-specified geometric structures into latent spaces to control 3D asset generation.
The method integrates geometric signals with pretrained 3D generative models, balancing spatial fidelity and visual realism via a tunable control parameter (τ0).
Empirical evaluations demonstrate state-of-the-art performance in spatial alignment and realism, with an interactive UI that supports real-time editing and prototyping.

SpaceControl: Test-Time Spatial Control in 3D Generative Modeling

Motivation and Context

The generation of 3D assets using generative models has advanced rapidly, notably through diffusion and flow-based architectures. However, existing 3D generative models generally lack fine-grained, intuitive spatial control over geometry at inference. Predominant conditioning modalities—text or image prompts—are inherently ambiguous or unnatural for specifying explicit geometry, limiting the applicability of automatic 3D creation tools for design and interactive workflows.

SpaceControl introduces a training-free, test-time controllable mechanism to bridge this gap. It injects user-specified geometric structures—ranging from coarse primitives (notably superquadrics) to detailed meshes—directly into the latent space of a state-of-the-art pre-trained 3D generative model (e.g., Trellis), enabling explicit spatial guidance of generated assets without any re-training.

Figure 1: SpaceControl achieves spatially controlled 3D asset generation from geometric primitives, enabling both rapid concept creation and fine-grained shape editing.

Methodology

Geometric Conditioning and Model Integration

SpaceControl acts entirely at inference: given a geometric signal (e.g., superquadric decomposition or a mesh), it encodes this structure using Trellis' encoder to obtain a latent geometric representation. This geometric latent can be injected into the denoising trajectory of the pretrained rectified flow model at an arbitrary step, parametrized by a control variable $\tau_0$ , biasing the model’s generation toward the user-specified geometry. Language and (optionally) image prompts are also incorporated, supporting multi-modal conditioning for semantics and appearance.

This scheme adapts the SDEdit concept from 2D image diffusion to 3D space: rather than fine-tuning or post-hoc optimization, test-time latent intervention guides the model. The architectural and training independence enables seamless use with new geometric conditions or model backbones.

Controlling the Realism–Faithfulness Tradeoff

The parameter $\tau_0$ exposes a continuous tradeoff between fidelity (alignment with input geometry) and realism (adherence to the learned 3D asset manifold). A small $\tau_0$ allows more denoising and thus more realistic, less constrained assets; higher $\tau_0$ induces greater geometric faithfulness, at potential expense to visual plausibility.

Figure 3: Qualitative comparison across baseline methods and SpaceControl; only SpaceControl achieves both spatial alignment and visual realism across diverse object categories and geometric controls.

Interactive User Interface

A novel interactive UI is presented, supporting online editing of superquadrics and direct preview and tuning of the $\tau_0$ control parameter, enabling designers to manipulate geometry and observe resulting 3D assets in real time.

Figure 5: Interactive user interface for real-time superquadric editing, strength selection, and multi-modal prompting.

Empirical Evaluation

Benchmarks and Baselines

SpaceControl is evaluated against leading baselines:

Spice-E: a Shap-E-based method with class-specific finetuning for primitive or mesh conditioned 3D generation,
Spice-E-T: an adaptation of Spice-E to Trellis architecture,
Coin3D: a guidance-based approach combining multi-view diffusion and volumetric score distillation.

Benchmarks encompass ShapeNet classes (chairs, tables) and the Toys4K dataset for generalization and out-of-domain scenarios.

Metrics

Chamfer Distance (CD): shape fidelity to input geometry,
CLIP-I: semantic alignment to text prompts via CLIP image-text similarity,
FID/P-FID: Fréchet Inception Distance on rendered images and point cloud features, for realism.

Results

SpaceControl achieves state-of-the-art geometric faithfulness (lowest CD) while maintaining high realism (comparable or superior FID/P-FID), across seen and unseen categories. This is achieved without any model or dataset-specific fine-tuning, demonstrating the effectiveness of pure test-time geometric guidance. The system retains semantic control via language/image prompts.

User studies further validate these results: participants overwhelmingly preferred SpaceControl generations for overall quality, spatial faithfulness, and visual realism.

Ablations and Analysis

Parameter $\tau_0$ : Increasing $\tau_0$ improves spatial adherence monotonically but degrades realism beyond a moderate threshold. Optimal setting (e.g., $\tau_0\in[4,6]$ ) depends on the use-case.
Image Conditioning: Used primarily for appearance in the second stage of Trellis, enabling style transfer from 2D images to 3D models.
Spatial Alignment: SpaceControl alone maintains fine-grained spatial alignment even with arbitrarily oriented or non-axis-aligned control shapes, a limitation for all other baselines.

Practical and Theoretical Implications

SpaceControl provides a general, scalable paradigm for geometry-aware 3D generative modeling, decoupling spatial control from model retraining or optimization-based inference. From a practical perspective, designers and artists gain efficient, direct manipulation tools for prototyping and asset creation, supporting workflows from coarse sketching to precise, component-wise editing.

Theoretically, this work demonstrates that test-time latent space interventions can mediate structural controllability in high-capacity pre-trained 3D generative models without harming their realism. The methodology invites further exploration into latent space geometry, compositional controllability (e.g., local, part-aware, or semantic region guidance), and hybrid multimodal interfaces for content creation.

Future Directions

Future research may focus on:

Automated or adaptive control strength estimation to select optimal $\tau_0$ for varying user intent or object categories.
Part-aware and region-specific spatial conditioning for fine-grained editing and hierarchical modeling.
Integrating hierarchical or learned geometric primitives beyond superquadrics for enhanced expressivity.
End-to-end differentiable editing pipelines for closed-loop user feedback.

Conclusion

SpaceControl proposes a versatile, training-free inference-time method for explicit spatial control of 3D generative models. Through direct latent space intervention, it enables controllable, realistic, and semantically coherent asset generation from arbitrary geometric inputs, addressing a substantial limitation in current 3D generative modeling pipelines. The results set a new empirical standard for spatial faithfulness and practical usability, paving the way for future advances in interactive, geometry-conditioned 3D content creation.

PDF Markdown

Whiteboard

SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Video Overview

Generate a video overview of this paper.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces SpaceControl, a new way to guide 3D object generators so they make shapes the way you want. Instead of relying only on text like “a wooden chair,” SpaceControl lets you sketch simple 3D shapes (like blocks and rounded forms) or use existing 3D models, and the generator will follow that structure. Best of all, it works with powerful existing models and doesn’t need any retraining.

What was the paper’s main goal?

The researchers wanted to make 3D generation both easy to control and high quality. Their big questions were:

Can we add clear, precise shape control to pre-trained 3D models without retraining them?
Will this control work with many kinds of input, from rough sketches (simple 3D primitives) to detailed meshes?
Can users adjust how strongly the generator follows their input shape versus how realistic it looks?
Does this method beat existing approaches in how closely the generated shape matches the input?
Can it be used interactively by artists and designers?

How did they do it? (Methods explained simply)

Think of the 3D generator as a two-part artist:

Structure first (the object’s shape),
Appearance second (textures, colors, materials).

SpaceControl adds a “shape steering wheel” to the first part.

Simple shapes as controls: The method uses “superquadrics,” which are basic 3D forms like squished cubes and rounded cylinders defined by just a few numbers (size and roundness). You can also use full 3D meshes.
Grid of 3D pixels: Your input shape is turned into a 3D grid of tiny blocks (voxels), like Minecraft but smaller.
Latent space guidance: The model normally starts from random noise and “cleans” it step by step to form a believable object. SpaceControl encodes your shape into the model’s hidden code (“latent space”), blends it with noise at a chosen point in time, and then lets the model do its usual cleaning from there. This is like starting a painting from a sketch that’s been lightly smudged, instead of from a blank canvas.
A simple control knob (τ₀): This knob decides how much the model should stick to your input shape versus how much it should rely on its training to look realistic. Lower values = more realism, less strict shape matching. Higher values = closer to your input shape, sometimes slightly less natural-looking.
Appearance stage: After the structure is set, the model adds textures and colors. You can guide this with text (“a floral chair”) or even an image to get a specific style. The final result can be a mesh, a radiance field, or 3D Gaussians—different ways of representing 3D for various tools.

Importantly, SpaceControl doesn’t retrain the model. It adds guidance during generation (test time), so it’s fast and flexible.

What did they find and why is it important?

Better shape matching: SpaceControl produced objects that more closely follow the input shapes than other methods that required retraining (like Spice-E) or heavy optimization (like Coin3D). They measured this with the Chamfer Distance, which is a score for how close two shapes are; SpaceControl consistently had lower (better) scores.
Similar realism: Even while matching shapes more closely, SpaceControl kept visual quality high, measured by scores like FID (for textures) and P-FID (for geometry). It was usually comparable to the best alternatives.
People preferred it: In a user study with 52 volunteers, SpaceControl was chosen more often for overall quality and faithfulness to the input shape.
Flexible control: The τ₀ knob lets you tune the balance between accuracy to your input and realism. The paper shows how changing τ₀ affects quality, and suggests mid-range values often work well.
Strong alignment: SpaceControl aligned objects correctly even when the input shapes were rotated or not axis-aligned, which other methods sometimes failed to do.
Practical tool: They built an interface where you can edit superquadrics live and instantly generate textured 3D assets, which is great for design workflows.

This matters because artists, designers, and game developers often start with rough 3D sketches and need precise control. SpaceControl makes that easy without sacrificing visual quality.

What does this mean for the future?

SpaceControl could speed up creating 3D content for games, VR, simulations, and product design:

Faster iteration: Start from simple 3D sketches and quickly get detailed, realistic assets.
Precise edits: Adjust parts like a chair’s backrest or add armrests and see the changes immediately.
Multi-modal styling: Use text and images to push textures and looks while keeping the shape you want.

The authors note a couple of limitations and future ideas:

Choosing τ₀ is manual right now; automatic tuning could make it even smoother.
The control strength is uniform across the object; part-specific control (stronger in some regions, looser in others) could add more creative freedom.

Overall, SpaceControl shows that we can guide powerful 3D generators with simple, clear spatial inputs—no retraining required—making 3D creation more accessible and efficient.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of concrete gaps and open questions that remain unresolved and could guide future research:

Automatic control-strength selection: tau0 is manually set per instance; develop data-driven or optimization-based procedures to select tau0 automatically given the prompt/control geometry and desired realism/faithfulness targets.
Spatially varying adherence: the method enforces a uniform adherence level over the whole object; design part-aware or region-weighted guidance (e.g., per-voxel/part masks or schedules) to tightly constrain some regions while allowing freedom elsewhere.
Dynamic guidance schedules: current control uses a single start time t0; investigate time-dependent guidance schedules tau(t) and adaptive strategies that modulate guidance during denoising to improve realism-faithfulness trade-offs.
Backbone generality: although claimed to be compatible with SAM 3D, results are only shown with Trellis; rigorously evaluate portability to other 3D generative backbones (e.g., SAM 3D, GET3D variants, mesh-native diffusion/flow models) and identify required adaptations.
Encoder dependence: the approach relies on Trellis’ structure encoder E, which is not used at inference in the base model; analyze sensitivity to encoder calibration and out-of-distribution (OOD) control geometries, and compare against alternative encoders or training-free encoders (e.g., errors from voxelization vs SDF/point encoders).
Representation bottleneck: structure is constrained in a 64×64×64 occupancy grid; quantify fidelity loss for thin structures and high-frequency details, and explore multi-resolution/hierarchical or higher-resolution structure representations (≥128³⁾ to reduce aliasing.
Input modality breadth: control geometry is voxelized meshes or parametric superquadrics; extend to other inputs (point clouds, SDFs, NURBS/CAD, partial scans) and compare their robustness and fidelity after encoding.
Robustness to imperfect control: evaluate behavior under noisy, partial, misaligned, or scale-mismatched control geometries; develop canonicalization and uncertainty-aware guidance that tolerates imperfect inputs.
Conflict resolution: study failure modes when text/image semantics conflict with control geometry (e.g., geometry of a chair with “airplane” prompt) and propose principled conflict-handling strategies or constraints.
Diversity under fixed control: quantify and control sample diversity (appearance and permissible geometric variations) for a fixed spatial control; propose diversity-aware sampling or conditioning schemes that preserve adherence.
Appearance disentanglement: image conditioning is only used in the appearance stage and mainly affects texture; investigate stronger, controllable disentanglement of geometry/material/lighting and enable explicit material/BRDF and lighting controls beyond text/image prompts.
Realism degradation at high tau0: high adherence settings degrade FID; design realism-preserving regularizers (e.g., discriminator guidance, priors, or hybrid losses) that maintain plausibility while retaining control.
Metric coverage and granularity: current metrics (CD, FID, P-FID, CLIP-I) overlook part-level alignment, normals/curvature, topology/manifoldness, watertightness, and multi-view consistency; introduce and report richer 3D metrics and automatic part-level alignment scores against the control.
Conversion quality across output formats: the paper lists GS/RF/mesh decoders but evaluates mainly meshes; systematically compare quality, artifacts, and consistency across formats, including manifoldness and editability of meshes for downstream pipelines.
Computational performance: claims of efficiency vs optimization-based methods lack runtime/memory benchmarks; report wall-clock latency, GPU memory, and throughput for different tau0 and resolutions, and compare to baselines.
Category and domain coverage: evaluations focus on ShapeNet chairs/tables and Toys4K; assess generalization to complex/topology-rich categories (e.g., bicycles, plants), real-world scans, and out-of-domain assets.
Scene-level control: the method targets single-object assets; extend to multi-object scenes with compositional spatial constraints (layout, collisions, occlusions) and evaluate scene realism and controllability.
Alignment and coordinate handling: aside from a qualitative example, there is no systematic study of pose/scale alignment errors; benchmark and improve alignment robustness, including automatic normalization of input control to model coordinates.
UI and usability evidence: the interface is presented but user-centered evaluation (latency, learning curve, time-to-target, iteration count, satisfaction) is missing; conduct formal HCI studies and integrate with DCC tools (e.g., Blender, Unreal) via plugins.
Training-free limits: analyze when latent injection fails (e.g., extreme OOD controls, very sparse primitives) and whether lightweight adapters or minimal fine-tuning could expand the controllable regime without sacrificing generalization.
Theoretical understanding: provide analysis of why/when latent-space seeding at t0 yields controllable flows; study the geometry of rectified flows under injected seeds and characterize stability/convergence vs guidance strength.
Benchmarking standardization: the custom evaluation uses SuperDec decompositions and Gemini-generated text; release the derived controls/prompts and establish standardized, open benchmarks for spatially controlled 3D generation to enable fair comparisons.
Perceptual-grounded evaluation: the user study measures preference but not task-oriented success; add perceptual tests targeting spatial accuracy (e.g., just-noticeable differences in part placement) and realism thresholds with statistically powered sample sizes.
Extension to dynamics: investigate whether the approach extends to non-rigid/animated assets (4D), enabling test-time spatial control over motion trajectories and articulated parts.

View Paper Prompt View All Prompts

Glossary

3D Gaussians (GS): Point-based 3D representation using Gaussian primitives for rendering and downstream decoding. "3D gaussians (GS), radiance fields (RF), and meshes (M) via specific decoders $\mathcal{D}_{O}=\{\mathcal{D}_{GS}, \mathcal{D}_{RF}, \mathcal{D}_M\}$ ."
Appearance Flow Model: The second-stage rectified-flow denoiser in Trellis that generates per-voxel appearance features conditioned by text or images. "denoised by the Appearance Flow Model (FM), using either text or image conditioning."
Binary occupancy grid: A discrete 3D voxel volume with 0/1 values indicating empty or occupied cells, representing object structure. "decoded by a decoder $\mathcal{D}$ into a binary occupancy grid $x \in \{0,1\}^{64\times64\times64}$ "
Chamfer Distance (CD): A symmetric distance between two point sets used to measure geometric alignment or faithfulness. "Faithfulness to the spatial control is quantified using the L2 Chamfer Distance (CD) between vertices sampled from the input superquadric primitives and the generated mesh decoded by $\mathcal{D}_{M}$ ."
CLIP similarity (CLIP-I): A metric measuring alignment between image renderings and textual prompts using CLIP embeddings. "Faithfulness to the textual control is quantified with the CLIP similarity (CLIP-I) between the renderings of generated assets and the textual prompts."
CLIP text encoder: The text feature extractor from CLIP used to condition generative models. "text conditions are encoded via the CLIP~\citep{radford2021learning} text encoder"
Coin3D: A guidance-based 3D generation pipeline that uses 2D single-view synthesis, multi-view diffusion, and SDS to reconstruct 3D. "Coin3D~\citep{dong2024coin3d} uses the shape-guidance to first generate a single view of the desired 3D asset, then leverage a Multi-View-Diffusion model to generate consistent multiple views, and finally extract the 3D representation using a volumetric-based score distillation sampling."
ControlNet: A method that adds a trainable control branch connected by zero convolutions to enable conditional guidance without forgetting pretrained knowledge. "ControlNet~\citep{zhang2023adding, bhat2024loosecontrol} which add conditional control to a section of the network by introducing a trainable copy connected to the original via zero convolutions."
Diffusion Transformer: A transformer architecture that predicts denoising dynamics (velocity fields) in diffusion/flow models. "the vector field $v_\theta(\cdot)$ is predicted for example by a Diffusion Transformer~\citep{peebles2023scalable} as in Trellis~\citep{xiang2024structured} or SAM 3D~\citep{sam3dteam2025sam3d3dfyimages}."
DINOv2: A pretrained vision transformer used to encode image conditioning for generative models. "image conditions are encoded via DINOv2~\citep{oquab2024dinov2}"
Fréchet Inception Distance (FID): A distributional distance between sets of images assessing visual realism via Inception features. "FrÃ©chet Inception Distance (FID)~\citep{heusel2017gans} on image renderings"
Latent space intervention: Injecting control by modifying model latents during inference to steer generation. "conditions a powerful pre-trained generative model (Trellis) on user-defined geometry via latent space intervention, enabling geometry-aware generation without the need for costly fine-tuning."
LatentNeRF: A guidance-based method that performs test-time optimization using latent diffusion and NeRF-style rendering. "guidance-based methods such as LatentNeRF \citep{metzer2023latent}"
Multi-View-Diffusion model: A diffusion approach that generates consistent multiple 2D views of an object for 3D reconstruction. "then leverage a Multi-View-Diffusion model to generate consistent multiple views"
P-FID: A point-cloud analogue of FID that evaluates geometric realism using point-based features. "P-FID~\citep{nichol2022point}, the point cloud analog for FID."
PointNet++: A hierarchical neural network for extracting features from point clouds. "PointNet++~\citep{qi2017pointnet++} features"
Radiance fields (RF): Implicit volumetric representations modeling color and density for view synthesis. "3D gaussians (GS), radiance fields (RF), and meshes (M)"
Rectified flow models: Flow-matching generative models that use a linear interpolation forward process and learn a velocity field to invert noise. "Rectified flow models use a linear interpolation forward (diffusion) process where for a specific time step $t\in[0,1]$ , the latent $z_t$ can be expressed as $z_t=(1-t)z_0 + t\epsilon$ "
SAM 3D: A recent two-stage rectified-flow 3D generative framework analogous to Trellis. "as in Trellis~\citep{xiang2024structured} or SAM 3D~\citep{sam3dteam2025sam3d3dfyimages}"
Score distillation sampling: An optimization technique that distills diffusion model gradients to fit 3D representations. "using a volumetric-based score distillation sampling"
SDEdit: A training-free editing method that restarts denoising from a partially noised input to follow coarse guidance. "SDEdit~\citep{meng2021sdedit} which uses stroke paintings to condition the generation of SDE-based generative models for images"
Shap-E: A neural generative model for 3D shapes that supports various output formats and can be fine-tuned for control. "by finetuning Shap-E~\citep{jun2023shap} separately on chairs, tables and airplanes from ShapeNet~\citep{chang2015shapenet}."
Spatial control: Explicit conditioning that constrains geometry during 3D generation. "introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D generation."
Structure Flow Model: The first-stage rectified-flow denoiser in Trellis that generates occupancy structure from noise. "employing the original Structure Flow Model."
Superquadrics: A compact parametric family of shapes defined by scales and exponents, suitable as geometric primitives for control. "Superquadrics \citep{Barr1981SuperquadricsAA} provide a compact parametric family of shapes capable of representing diverse geometries."
Trellis: A two-stage 3D generative model that separately synthesizes structure and appearance using rectified flows. "Trellis~\citep{xiang2024structured} is a recent 3D generative model which employs rectified flow models to generate 3D assets from either textual or image conditioning."
Velocity field: The time-dependent vector field that drives the denoising trajectory in rectified flow models. "The backward (denoising) process is represented by a time dependent velocity field $v(z_t,t)=\nabla_tz_t$ ."
Voxel grid: A structured, discrete 3D grid of voxels used to represent object occupancy or features. "outputs the voxel grid $x_0$ "
Voxelization: The process of converting geometric input into a voxel grid for encoding and guidance. "we voxelize it to obtain $x_c\in\{0,1\}^{64\times64\times64\}$"

View Paper Prompt View All Prompts

Practical Applications

Overview

Below are practical, real-world applications that follow directly from the paper’s findings and innovations in SpaceControl—a training-free, test-time spatial control method for 3D generative modeling. Each item specifies sector alignment, concrete use cases, potential tools/workflows, and key assumptions or dependencies that affect feasibility.

Immediate Applications

These applications can be deployed now using the released Trellis/SAM‑3D-based SpaceControl workflow, the provided superquadrics UI, and standard DCC (digital content creation) tools.

- Game and VFX asset authoring — Software/Entertainment Accelerate “blockout-to-hero asset” workflows by turning coarse 3D sketches (superquadrics or rough meshes) and short prompts into high‑quality, textured meshes. Potential tools/workflows: SpaceControl Blender/Unreal/Unity plugins; “Sketch‑to‑Asset” microservice returning glTF/FBX; τ₀ knobs for realism‑faithfulness control; batch asset variant generation. Assumptions/dependencies: Requires pre‑trained Trellis/SAM‑3D weights, GPUs, and domain coverage in training data; voxelization (64³) may limit small details; licensing for commercial use.
- AR/VR scene building and prototyping — Software/AR/VR Rapidly populate immersive environments from simple spatial scaffolds while retaining precise placement and dimensions (chairs, tables, props). Potential tools/workflows: In‑editor parametric superquadrics gizmos; image‑conditioned texture transfer for brand fidelity; exports to meshes, radiance fields, or 3D Gaussians. Assumptions/dependencies: Consumes platform‑specific decoders; texture realism primarily driven by appearance stage; performance tuning for runtime constraints.
- Product and furniture design configurators — Manufacturing/Retail/E‑commerce Let designers or customers specify dimensions and silhouette via superquadrics and generate photoreal variants (e.g., add armrests, adjust backrest height). Potential tools/workflows: “Parametric Furniture Studio” web app; CAD handoff via mesh export; τ₀ presets for “exact geometry” vs “stylized concept.” Assumptions/dependencies: Structural validity not guaranteed; requires downstream checks for manufacturability, tolerances, and watertight meshes for CAD/3D printing.
- Interior design visualization — Architecture/Design Convert massing models into textured furniture/fixtures consistent with design language using image‑conditioned appearance; precisely control footprint and arrangement. Potential tools/workflows: SketchUp/Rhino integration; reference image mood boards driving textures; quick A/B scene variants via τ₀ sweeps. Assumptions/dependencies: Scale and code compliance must be handled downstream; realism vs faithfulness trade‑offs need designer oversight.
- E‑commerce 3D content production — Retail/Marketing Generate product 3D assets with accurate silhouettes and brand‑consistent textures from product photos and dimensional specs. Potential tools/workflows: “3D Asset Factory” microservice; CLIP‑prompt catalog; image‑conditioned texture consistency across SKUs; glTF/WebGL delivery. Assumptions/dependencies: IP ownership of training/reference images; alignment to real‑world dimensions requires QA.
- Synthetic data creation for vision — AI/ML Produce labeled, geometry‑controlled assets for training object detectors/segmenters; tune τ₀ for diversity vs shape precision; use CLIP‑I/P‑FID for automated QA. Potential tools/workflows: Dataset generator that emits meshes plus renderings and masks; programmatic primitive generation for coverage. Assumptions/dependencies: Domain shift vs target environment; requirement to annotate materials or affordances separately.
- Robotics simulation assets — Robotics Quickly generate obstacles and manipulables with precise shape constraints for physics simulators (e.g., cluttered environments, furniture layouts). Potential tools/workflows: “World Builder” that ingests primitive layouts and outputs textured meshes; τ₀ presets for “strict geometry” worlds. Assumptions/dependencies: Physical properties (mass, friction, articulation) not modeled; must pair with physics parameterization.
- Texture/style transfer for 3D assets — Design/Branding Use reference images to transfer appearance while holding geometry fixed; ideal for brand look consistency across generated variants. Potential tools/workflows: Batch texture pipeline using image‑conditioned appearance stage; look‑book driven styling. Assumptions/dependencies: Image conditioning primarily affects texture, not geometry; reference image rights required.
- Education and onboarding to 3D modeling — Education Lower barrier for novices: edit superquadrics, add a short prompt, generate a textured model; teach realism‑faithfulness trade‑offs via τ₀. Potential tools/workflows: “SpaceControl Classroom” sandbox; rubric based on CD/CLIP‑I/FID for constructive feedback. Assumptions/dependencies: Requires accessible hardware or cloud; curated prompt/reference libraries; content safety filters.
- Rapid prototyping for 3D printing (concept stage) — Maker/DIY Move from conceptual shapes to printable meshes for early evaluation; enforce strict geometry via higher τ₀. Potential tools/workflows: STL export; auto‑watertight and thickness checks via downstream tools. Assumptions/dependencies: Structural integrity and tolerances must be validated; post‑processing for printability needed.

Long-Term Applications

These use cases require further research, scaling, or development (e.g., part‑aware control, physical validity, automatic τ₀ tuning, higher‑resolution geometry).

- CAD co‑pilot with constraint‑aware generation — Manufacturing/Engineering Generate parametric solids respecting dimensions, tolerances, and mating constraints; per‑part control to lock critical regions while stylizing others. Potential tools/workflows: SpaceControl‑for‑CAD with per‑region τ₀ maps; constraint solver integration; STEP export. Assumptions/dependencies: Requires part‑aware control, solid modeling kernels, and manufacturability checks.
- Interior layout optimization and compliance — Architecture/Policy Auto‑generate room layouts meeting accessibility and building codes while preserving designer’s spatial scaffolds. Potential tools/workflows: Constraint‑driven τ₀ tuning; rule‑checking engines; BIM integration. Assumptions/dependencies: Formal encoding of regulations; accurate scale and metadata; human review.
- Industrial digital twins and asset libraries — Energy/Industrial IoT Build large, faithful 3D libraries of equipment with controlled geometry for simulations and operator training. Potential tools/workflows: Domain‑specific fine‑tuning or adapters; PBR material pipelines; provenance tags. Assumptions/dependencies: Industry‑specific datasets; physics/material correctness; lifecycle management.
- Autonomous driving simulation world synthesis — Automotive Generate varied, constraint‑controlled urban assets (street furniture, signage) for simulation and testing coverage. Potential tools/workflows: Procedural primitive layouts + SpaceControl; scenario parameter sweeps; sensor rendering pipelines. Assumptions/dependencies: Dynamic behavior and traffic rules not covered; large‑scale asset QA.
- Healthcare prosthetics and aids (concept design) — Healthcare Create patient‑specific conceptual shapes respecting anatomical constraints; later refined into clinically valid devices. Potential tools/workflows: Scan‑to‑primitive decomposition + SpaceControl; clinical approval workflows. Assumptions/dependencies: Medical validation, biocompatibility, and regulatory compliance; high‑resolution geometry needed.
- Object synthesis with affordances for robot manipulation — Robotics Generate objects with graspable regions and tool interfaces based on spatial constraints and task goals. Potential tools/workflows: Affordance‑guided per‑part τ₀; physics simulators; grasp planners. Assumptions/dependencies: Requires physical property modeling and semantic part labeling; task‑specific training.
- On‑device real‑time generation for AR — Mobile/AR Deliver interactive, spatially constrained 3D generation on phones/glasses (e.g., interior preview in situ). Potential tools/workflows: Model compression/distillation; streaming decoders; low‑latency voxelization. Assumptions/dependencies: Compute and memory limits; privacy/security for reference images.
- Marketplace for parametric, controllable 3D assets — Platforms Offer user‑generated assets with provenance tags and adjustable geometry controls; embed licensing and watermarking. Potential tools/workflows: SpaceControl API; content moderation; provenance standards. Assumptions/dependencies: Clear IP frameworks, dataset transparency, and trust mechanisms.
- Automatic realism‑faithfulness tuning — AI/Optimization Learn τ₀ selection from task objectives (e.g., minimize CD under realism constraints) or user preferences. Potential tools/workflows: Bayesian optimization or reinforcement learning over τ₀; multi‑objective scoring. Assumptions/dependencies: Reliable objective metrics; sufficient compute for iterative tuning.
- 4D dynamic asset generation — Animation/Simulation Extend spatial control to time‑varying shapes (articulation, deformation) with consistent appearance. Potential tools/workflows: Temporal flow models; sequence decoders; per‑frame part‑aware control. Assumptions/dependencies: New training regimes; increased compute; temporal coherence constraints.
- Reverse engineering from partial scans — AEC/Industrial Decompose LiDAR/RGB‑D scans into primitives and regenerate faithful, textured models that fill missing data. Potential tools/workflows: Robust superquadric decomposition; completion via SpaceControl; QA tools for accuracy. Assumptions/dependencies: Scan quality; domain‑specific priors; tolerance to noise and occlusion.
- Standards and policy for synthetic 3D content — Policy/Compliance Establish provenance, disclosure, and quality benchmarks for generated assets used in commerce, safety‑critical simulations, or public communications. Potential tools/workflows: Asset watermarking; audit trails; standardized metrics (CD, P‑FID, CLIP‑I) in QA pipelines. Assumptions/dependencies: Cross‑industry coordination; regulatory adoption; transparent dataset documentation.

Notes on Assumptions and Dependencies

Pretrained model reliance: SpaceControl depends on models like Trellis/SAM‑3D, their encoders/decoders (CLIP, DINOv2), and their training domain coverage.
Geometry resolution: Current voxelization (64×64×64) and latent sizes may miss very fine details; exporting to high‑res meshes may require post‑processing.
Human‑in‑the‑loop: The τ₀ parameter is manually tuned; automated selection and per‑part control are future work.
Physical validity: Outputs are visually realistic but do not guarantee structural integrity or physical properties; downstream validation is needed for manufacturing/simulation.
IP and licensing: Use of training/reference images and model weights must respect licensing and provenance requirements; content moderation may be required.
Compute constraints: Real‑time workflows need GPUs; mobile/on‑device scenarios require model compression and optimization.
Integration readiness: DCC/CAD/BIM pipelines require format compatibility (glTF/FBX/STL/STEP) and potential custom decoders; QA with metrics (CD, FID, P‑FID, CLIP‑I) should be built into pipelines.

SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling (2512.05343v1)

Sponsor

Summary

SpaceControl: Test-Time Spatial Control in 3D Generative Modeling

Motivation and Context

Methodology

Geometric Conditioning and Model Integration

Controlling the Realism–Faithfulness Tradeoff

Interactive User Interface

Empirical Evaluation

Benchmarks and Baselines

Metrics

Results

Ablations and Analysis

Practical and Theoretical Implications

Future Directions

Conclusion

Whiteboard

Video Overview

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

What was the paper’s main goal?

How did they do it? (Methods explained simply)

What did they find and why is it important?

What does this mean for the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Notes on Assumptions and Dependencies

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

GitHub

Tweets

YouTube