- The paper introduces a novel diffusion model that infers part connectivity using vision-language models to generate 3D articulated objects from a resting image.
- It employs a coarse-to-fine pipeline with transformer-based diffusion, integrating image cross attention and classifier-free guidance to capture geometry and motion.
- Experiments demonstrate superior reconstruction quality and kinematic plausibility, outperforming baselines on datasets like PartNet-Mobility and ACD.
This paper, "SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects" (2410.16499), addresses the challenging task of creating 3D articulated object assets from a single RGB image showing the object in a resting (closed) state. This is valuable for populating virtual environments used in robotics, embodied AI, and gaming, as manual 3D modeling is time-consuming and expensive. Existing methods often require multi-view or multi-state image inputs, which are not always practical or scalable.
The core challenges in generating articulated objects from a single image include:
- Ambiguity: A single view, especially in a resting state, can hide part shapes and articulation mechanisms due to occlusion.
- Complexity: Articulated objects have diverse and complex part structures within and across categories.
- Perception: Small and thin parts can be difficult to discern from a single image.
To tackle these issues, the authors propose a generative approach using a diffusion model. They argue that a generative model can capture the plausible variations in geometry and kinematics suggested by an ambiguous single image input. The method decomposes the process into a coarse-to-fine pipeline:
- Part Connectivity Graph Inference: Given a single input image, a part connectivity graph is first inferred. The authors found that large vision-LLMs, specifically GPT-4o with in-context learning, can effectively predict the part connectivity from an image. This graph specifies the parent-child relationships between parts, which is crucial for defining the object's structure and kinematic hierarchy.
- Abstract Part Attribute Generation: A transformer-based diffusion model is designed to generate abstract attributes for each articulated part. These attributes include the part's bounding box, semantic label (e.g., door, drawer, handle), articulation type (e.g., revolute, prismatic), joint axis location/direction, and motion range. The diffusion model is conditioned on both the input image and the inferred part connectivity graph.
- Part Mesh Retrieval and Assembly: Once the abstract attributes are generated, the final 3D object is assembled by retrieving part meshes from a predefined library based on the predicted semantic labels. The retrieved meshes are then positioned and oriented according to the generated bounding boxes and joint parameters. This approach leverages a part shape library rather than generating geometry from scratch, which is practical for household objects with often simple part geometries.
The diffusion model's architecture is based on a transformer with multiple attention mechanisms:
- Local Attention: Harmonizes attributes within a single part.
- Global Attention: Coordinates parts to form a coherent object.
- Graph Relation Attention: Incorporates the part connectivity graph using an adjacency matrix as an attention mask.
- Image Cross Attention (ICA): Conditions the part generation on the input image. This module uses DINOv2 features of the image patches. Notably, the attention queries for the ICA are primarily derived from the bounding box parameters, allowing each part's generation to focus on relevant regions in the image without explicit part detection or segmentation supervision during training.
For training the diffusion model, the authors employ a classifier-free guidance strategy. This involves randomly dropping the conditioning inputs (image, graph, and object category) during training (e.g., 50% rate for graph/category, 10% for image). This encourages the model to generate plausible objects even without strict conditioning and allows for better generalization. During inference, guidance is applied to steer the generation towards the provided conditions. The training objective includes a standard noise residual loss (Lϵ) and a novel foreground attention loss (Lfg) that encourages part attention to focus on the object's foreground patches in the image.
The method is evaluated quantitatively and qualitatively on the PartNet-Mobility and ACD datasets, which contain household articulated objects. The evaluation uses refined metrics that better capture part-level and state-level similarities between generated and ground-truth objects (gIoU, cDist, CD, across both resting and articulated states), along with graph accuracy and collision detection.
Experiments show that SINGAPO outperforms state-of-the-art baselines like URDFormer (a regression-based single-image method) and NAP (an unconditional generative method extended with image conditioning, denoted NAP-ICA). SINGAPO achieves better reconstruction quality, better adherence to the input image, and more kinematically plausible part articulations, especially when generalizing to the more complex ACD dataset. A user paper also confirms that objects generated by SINGAPO are perceived as more realistic.
Ablation studies highlight the importance of the Image Cross Attention (ICA) module and the classifier-free training strategy for conditioning inputs (image, graph, category) for achieving good performance and generalization.
Limitations include challenges with highly complex object structures, cluttered scenes, and images with challenging textures, which can lead to incorrect graph predictions or less accurate part geometry and arrangement. The reliance on mesh retrieval also means fine geometric details specific to the input image might not be captured. Future work directions include extending the method to a wider range of object categories and developing methods for generating part geometries that are more faithful to the input image details.
In summary, SINGAPO provides a practical and scalable approach for generating 3D articulated objects from a single image by combining a generative diffusion model with vision-LLM-based graph inference and image-based conditioning, leveraging existing part libraries for geometry.