Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Steerable Scene Generation with Post Training and Inference-Time Search (2505.04831v1)

Published 7 May 2025 in cs.RO, cs.GR, and cs.LG

Abstract: Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/

Summary

  • The paper introduces a unified diffusion model that generates diverse, physically grounded 3D scenes critical for simulation-based robotic training.
  • It employs reinforcement learning post-training, conditional generation, and inference-time MCTS to steer scene synthesis toward task-specific objectives.
  • It demonstrates strong improvements over baselines with quantitative metrics like FID, CA, and APF, ensuring simulation-ready scenes without inter-object penetrations.

This paper introduces a comprehensive framework for generating diverse, physically plausible, and steerable 3D scenes for robotic applications, particularly simulation-based training. Recognizing the challenge of manually curating or procedurally generating task-specific, high-clutter environments with strict physical requirements, the authors propose training a unified generative model that can then be adapted to downstream objectives.

The core of the approach is a diffusion-based generative model trained on a massive dataset of procedurally generated SE(3) scenes. A scene is represented as an unordered set of objects, each defined by a category from a fixed library and a continuous SE(3) pose (translation pR3\mathbf{p} \in \mathbb{R}^3 and rotation R\mathbf{R} represented as a 9D vector projected to SO(3)). The model utilizes a mixed discrete-continuous diffusion framework [MiDiffusion] and a Flux-style architecture [flux2024] modified to preserve object-order equivariance by omitting positional encodings.

A significant practical consideration is ensuring physical feasibility, as generative models can produce scenes with inter-object penetrations or unstable configurations. The framework includes a crucial post-processing step after generation. This involves a non-linear optimization to project object translations to the nearest collision-free configuration, followed by a physics simulation (using Drake [drake]) to allow unstable objects to settle under gravity, guaranteeing physically plausible and simulation-ready scenes.

A key contribution is the exploration of methods to steer the pretrained model towards downstream goals, even when these goals differ from the original training data distribution. Three complementary strategies are presented:

  1. Reinforcement Learning (RL)-based Post Training: The continuous diffusion model is fine-tuned using DDPO [black2023ddpo] with task-specific rewards, such as object count (clutter). This process adapts the underlying generative distribution itself. The authors demonstrate successful extrapolation to scenes with significantly higher object counts than seen during pretraining. They note the use of a continuous diffusion model for compatibility with existing RL methods and apply regularization [zhang2024largescalereinforcementlearningdiffusion] for stability.
  2. Conditional Generation: The model supports flexible conditioning at inference time. This includes text conditioning, where BERT embeddings [bert] of text prompts are injected into the model, and classifier-free guidance (CFG) [ho2022classifierfreediffusionguidance] is used to steer generation. It also includes structured inpainting, where missing parts of a scene (objects or attributes) are synthesized while clamping unmasked components, enabling scene completion and rearrangement tasks.
  3. Inference-Time Search via MCTS: A novel Monte Carlo Tree Search procedure is introduced to steer generation towards objectives by iteratively improving a scene. Starting from a masked scene (e.g., fully masked), MCTS explores partial scene completions. At each step, it selects a child node (partially inpainted scene), expands it by sampling multiple completions (via inpainting masked objects), scores a rollout scene using a task-specific reward (e.g., number of physically feasible objects), and backpropagates the reward to update tree statistics. The mask generator and reward function are modular components that can be tailored to different objectives.

The authors generated and released a large-scale dataset of over 44 million unique SE(3) scenes across five diverse scene types (Breakfast Table, Dimsum Table, Living Room Shelf, Pantry Shelf, Restaurant, including low/high clutter variants) using probabilistic context-free scene grammars [greg_thesis]. This dataset significantly surpasses the scale and dimensionality (SE(3) vs SE(2)) of existing scene datasets used in generative modeling. Training data generation was parallelized across numerous CPUs to achieve this scale, highlighting the computational cost of procedural generation compared to sampling from the trained generative model. Text prompts for conditional generation were also automatically generated using a rule-based system based on scene content.

Experimental evaluation using image-based metrics (FID, CA) on semantic renderings, along with KL divergence, average prompt-following accuracy (APF), and median total penetration (MTP), demonstrates the effectiveness of the approach. The proposed model achieves strong unconditional and conditional generation quality, outperforms baselines in key metrics (lower CA, FID, MTP, higher APF), and successfully adapts scene distributions via RL post-training and inference-time search. Qualitative results and supplementary videos demonstrate that generated scenes are physically plausible and simulation-ready for robotic manipulation. The authors also show that the model learns the data distribution rather than memorizing training scenes by comparing generated samples to their nearest neighbors using Optimal Transport distance.

Implementation details include using 8 NVIDIA A100 GPUs for training (40GB or 80GB depending on scene complexity), AdamW optimizer, cosine learning rate schedule, and full precision training for better accuracy on physically grounded metrics. The model architecture involves 15 Transformer blocks and has 88.3 million parameters. Training times ranged from ~1 to 1.65 days. MCTS parameters include branching factor and number of iterations; mask generation involves checking for penetration and static equilibrium using Drake.

Limitations acknowledged include the potential mismatch between procedural data and real-world complexity, the use of continuous-only models for current RL post-training, the focus on rigid bodies (extending to articulated objects is future work), and the need for more sophisticated task-specific objectives and scaling to large-scale autonomous robot training.

Overall, the work presents a practical and scalable approach for creating diverse, physically grounded, and steerable 3D scenes crucial for advancing robot learning in simulation by leveraging large-scale data distillation into a flexible diffusion prior and developing effective steering mechanisms.