The paper "URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images" (Chen et al., 19 May 2024 ) introduces a novel system for generating realistic, diverse, and controllable articulated simulation environments directly from real-world RGB images. This addresses a critical bottleneck in data-driven approaches for robotics, computer vision, and other domains: the manual and labor-intensive process of creating high-quality simulation assets with accurate physical and kinematic properties.
The core idea is to train an inverse model that can infer a structured scene description, specifically in the Unified Robot Description Format (URDF), from a single real-world image. Since large datasets of real-world images paired with corresponding URDFs do not exist, the authors propose an "inversion through synthesis" approach. This involves a two-phase pipeline:
- Forward Phase (Data Generation): Create a large, paired dataset of structured simulation scenes (represented as URDFs, ) and corresponding realistic RGB images (). This is achieved by rendering procedurally generated or existing simulated scenes and then using controllable text-to-image generative models to augment these renders into visually realistic images while preserving the underlying structure.
- Inverse Phase (Model Training): Train a neural network, named URDFormer, on this synthetic paired dataset to learn the mapping from realistic images () back to the structured simulation scene descriptions ().
Problem Formulation
The scene structure is defined as a collection of objects, each specified by its class label, 3D bounding box, 3D transform, kinematic parent, and joint type. This representation is akin to URDF, commonly used for describing robots and articulated objects. The challenge is inferring this complex from a simple observation (like an image), which is the result of an unknown forward function . The lack of real-world pairs necessitates the synthetic data generation approach.
Controlled Generation of Paired Datasets
To overcome the lack of real-world paired data, URDFormer leverages the capabilities of pre-trained controllable generative models (like Stable Diffusion). Simulated renders, while structurally accurate, often lack visual realism. Generative models can enhance these renders, but naive application can alter structural details. The authors propose a controlled generation process that differentiates between scene-level and object-level data generation:
- Scene-Level: Render an entire simulated scene (e.g., a kitchen) and use a text-to-image diffusion model guided by the rendered image and a text prompt. The diffusion model is conditioned to maintain the global layout from the render but adds realistic textures and details. This process might change low-level object details or categories, so the resulting paired data contains complete images () but only partial labels (), including high-level object bounding boxes, transforms, and parents, but not accurate low-level part details.
- Object-Level: For individual articulated objects (e.g., cabinets with drawers and doors), the generative process needs to preserve fine-grained part structures. Instead of full image generation, a texture-guided approach is used. Diverse texture images are generated or sourced (Appendix A). These textures are then overlaid onto the rendered object parts using perspective warping based on the known geometry from the simulation. Generative models are then used for background generation and smoothing boundaries, ensuring consistency at the part level. This results in partial images (, focusing on a single object) but complete labels () for that object and its parts.
This controlled generation yields two datasets: $\mathcal{D}_{\text{scene} = \{(x, \tilde{z})\}$ and $\mathcal{D}_{\text{object} = \{(\tilde{x}, z)\}$.
Learning Inverse Generative Models (URDFormer Architecture)
The URDFormer architecture is designed to process images and predict URDF primitives, using the two partially complete datasets. Both the scene-level () and object-level () models share the same fundamental architecture but are trained on their respective datasets.
The architecture processes an input image:
- A Vision Transformer (ViT) extracts global image features.
- Bounding boxes corresponding to objects or parts are provided (either ground truth during training or from a detection model during inference).
- ROI alignment extracts features for each bounding box.
- Box features are combined with learned embeddings of the bounding box coordinates.
- A Transformer processes these features to produce a representation for each object/part.
- An MLP decodes each object/part feature into:
- An optional base class label (used in object-level prediction).
- A discretized 3D position and bounding box relative to its parent.
- Learned child and parent embeddings.
- Hierarchical relationships (parent-child) are predicted using a scene graph generation technique: computing dot products between parent and child embeddings to form a relationship score matrix.
For scene-level prediction, special learned embeddings for root objects (walls, floor, ceiling) are included to attach scene objects. At test time, a real image is fed to a detection model to get initial bounding boxes. The Global URDFormer () uses these boxes and the image to predict high-level scene structure (object positions and parents). Then, regions corresponding to predicted objects are cropped, a second detection model finds part-level boxes, and the Part URDFormer () is applied to each object crop and its part boxes to predict the detailed kinematic structure of parts.
Using URDFormer for Robotic Control
A key application demonstrated is using URDFormer in a real-to-simulation-to-real pipeline for training robot manipulation policies. Instead of creating a perfect "digital twin" for model-based control (which is fragile due to potential inaccuracies), URDFormer enables training learning-based policies using targeted randomization in simulation.
The pipeline involves:
- Scene Generation: Given a real-world observation (RGB-D point cloud), use URDFormer on the RGB image to predict a URDF structure. Scale the predicted structure using depth measurements.
- Targeted Randomization: Import the predicted URDF into a physics simulator. Collect training data by solving tasks (e.g., opening/closing drawers) using an efficient motion planner (like cuRobo) which has access to privileged simulation information. To bridge the sim2real gap and account for URDFormer's prediction errors (e.g., incorrect mesh details), randomize the simulated environment around the predicted structure. This includes replacing meshes of parts with variations from datasets like PartNet, randomizing textures (by cropping real textures and generating variations with Stable Diffusion), and applying standard image augmentations. This randomization is "targeted" because it is based on the predicted real-world configuration, unlike blind procedural generation.
- Policy Synthesis: Train a robot policy (e.g., a language-conditioned behavior cloning policy operating on RGB point clouds, using an M2T2-like architecture) on the large dataset of successful trajectories collected in the randomized simulation.
This pipeline allows training policies that generalize well to the real world from raw perceptual input with minimal human effort compared to manual scene creation or extensive real-world data collection.
Experiments
The paper evaluates URDFormer in several ways:
- Real-world Robot Control: A UR5 robot with an RGB-D camera is used for articulated object manipulation tasks on five different cabinets. The URDFormer-TR pipeline (URDFormer prediction + Targeted Randomization training) is compared against zero-shot OWL-ViT detection for motion planning, standard Domain Randomization (DR), and a URDFormer-ICP approach (URDFormer prediction + ICP pose tracking for model-based execution). Results show URDFormer-TR achieves an average 78% success rate across tasks, significantly outperforming baselines (DR: 9%, OWL-ViT: 0%, URDFormer-ICP: 53.3% on available tasks), demonstrating the benefit of targeted randomization informed by URDFormer's prediction.
- Simulation Content Generation Accuracy: URDFormer's ability to generate plausible and accurate URDFs from internet images is evaluated on manually labeled test sets of individual objects (300 images) and kitchen scenes (54 images). Metrics include category accuracy, parent accuracy, spatial error, precision, and recall for both high-level objects and low-level parts. Qualitative results show URDFormer captures scene structure reasonably well, though errors occur. An ablation paper confirms that training with generated realistic textures improves global scene prediction accuracy, while part prediction is less affected, possibly because bounding box spatial relationships are sufficient for simple part structures. The paper also highlights the performance gap when using bounding boxes from a fine-tuned detector (Model Soup of pretrained and fine-tuned Grounding DINO) compared to ground truth boxes, though detection performance is improved by the Model Soup technique (F1 79.7% vs 53.4% for pretrained).
- Generalization: The authors demonstrate URDFormer's ability to generalize to new object categories (toilet, microwave, desk, laptop, chair) and scene categories (bedroom, bathroom, laundry room, paper room) by training on expanded datasets (Figures 9-12). They also show the pipeline can be applied to a different robot (Stretch) for a multi-step task (Figure 13, 14), showcasing its flexibility.
- Reality Gym: The generated assets form the basis of Reality Gym, a new robot learning suite providing diverse, interactive simulation environments derived from real-world images.
Limitations
The paper acknowledges several limitations:
- Reliance on the performance of the bounding box detection model.
- Inability to reconstruct accurate meshes or complex textures; relies on predefined meshes and simple texture projection.
- Limited to basic URDF primitives (prismatic/revolute joints), not complex objects like cars.
- Predicted URDFs may have link collisions requiring post-processing.
- The pipeline consists of multiple non-end-to-end trained components.
- Physical properties (mass, friction) are not inferred from images.
Conclusion
URDFormer presents a significant step towards scalable generation of articulated simulation environments from real-world images. By synthesizing paired data with controllable generative models and training an inverse model, it enables the creation of diverse and realistic simulation assets. Integrating this pipeline with targeted domain randomization proves highly effective for training robot manipulation policies that transfer zero-shot to the real world, reducing the dependency on manual simulation design and extensive real-world data collection. The Reality Gym dataset provides a valuable resource for future research leveraging this approach.