- The paper introduces ArtFormer, a novel framework for controllably generating diverse 3D articulated objects from text or images by representing objects as tree structures and predicting sequential tokens.
- ArtFormer utilizes a diverse shape prior trained with diffusion and codebooks for part geometry generation and an Articulation Transformer with a Tree Position Embedding for modeling structure and kinematics.
- Experiments show ArtFormer generates higher quality geometry, greater diversity, and better text alignment than baselines, demonstrating the ability to create novel shapes despite dataset limitations.
This paper introduces ArtFormer, a novel framework for the controllable generation of diverse 3D articulated objects from text or image descriptions. The key challenge addressed is simultaneously generating high-quality geometry for individual parts and accurate kinematic relationships between them, which existing methods struggle with, often relying on predefined structures or dataset retrieval.
The ArtFormer framework addresses these limitations by representing an articulated object as a tree structure, where each node corresponds to a sub-part. Each node/token contains attributes defining the sub-part's geometry (bounding box bi, geometry latent code zi) and its kinematic relation to its parent (joint axis ji, joint limits li). This parameterization converts the problem of generating an articulated object into generating a sequence of tokens representing this tree.
The core of ArtFormer consists of two main components:
- Diverse and Controllable Shape Prior: Instead of directly generating high-dimensional geometry, ArtFormer generates a compact latent code zi for each sub-part. This latent code is then decoded by a Signed Distance Function (SDF) shape prior. The shape prior is learned via a VAE encoder (q(z∣f)) and decoder (p(f∣z) combined with a generalizable SDF network (Ω(f,x)) trained on point clouds. To enable controllable and diverse geometry generation, a conditional diffusion model ϵ(zt,t,c^g,cs) is trained on the latent space p(z). This diffusion model is conditioned on geometry features cg=Eg(z) and semantic information cs=Es(name). A key aspect for diversity is discretizing the geometry condition cg using codebooks and sampling via Gumbel-Softmax during training, allowing the model to predict logits Pi for sampling from codebooks during inference, rather than predicting the high-dimensional z directly. This approach ensures geometry quality while promoting diversity.
- Articulation Transformer: A transformer architecture is employed to generate the sequence of tokens representing the articulated object tree. Each token comprises the parent index fai and concatenated attributes [bi,zi,ji,li], processed by an MLP tokenizer. To effectively model the tree structure, a novel Tree Position Embedding (TPE) is introduced, based on processing paths from the root to each node using a GRU and concatenating absolute position encodings. Conditional generation (primarily text-guided) is incorporated using cross-attention layers, where conditioning tokens (e.g., from a pre-trained text encoder like T5) are fed into the transformer. The generation process uses an iterative decoding procedure, starting from a special start token S and predicting child nodes for existing nodes in each step until terminal tokens T are outputted for all nodes. This autoregressive process helps capture inter-dependencies between parts. The transformer is trained with binary cross-entropy for the terminal token prediction (Lo) and MSE for the attributes (La), along with a KL divergence loss (LP) for the codebook logits to align with the distributions derived from the shape prior's latent codes.
For practical implementation, the shape prior is trained first on datasets like PartNet and PartNet-Mobility. Then, the Articulation Transformer is trained on PartNet-Mobility, using text descriptions generated by GPT-4o from object snapshots. The geometry latent codes z from the pre-trained shape prior are used to supervise the attribute prediction, and the derived codebook distributions supervise the logits P.
Experiments demonstrate ArtFormer's effectiveness compared to baselines like NAP and CAGE. ArtFormer, which directly generates geometry, significantly outperforms NAP variants in metrics like Minimum Matching Distance (MMD), Coverage (COV), and Part Overlapping Ratio (POR), indicating higher quality geometry and more plausible kinematics. Compared to CAGE (which retrieves parts), ArtFormer shows better COV and 1-NNA, suggesting superior diversity and distribution coverage, while CAGE has better MMD and POR. Human studies further support ArtFormer's ability to generate more diverse objects and better align with text instructions. The ability to generate novel shapes is shown by analyzing the Chamfer Distance between generated parts and training set parts. The framework's flexibility is validated through image-guided generation by replacing the text encoder with an image encoder (BLIP-2). Ablation studies highlight the critical contributions of both the Tree Position Embedding and the Shape Prior to the performance.
While successful, the paper notes several limitations, including the dependence on limited datasets, which restricts the diversity of object types and number of parts that can be generated. Future work could explore larger-scale datasets, incorporate richer multi-modal inputs beyond text and images (e.g., point clouds or target joint structures), and improve the modeling of complex articulation details specified in conditions.