AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation (2203.09516v3)
Abstract: Powerful priors allow us to perform inference with insufficient information. In this paper, we propose an autoregressive prior for 3D shapes to solve multimodal 3D tasks such as shape completion, reconstruction, and generation. We model the distribution over 3D shapes as a non-sequential autoregressive distribution over a discretized, low-dimensional, symbolic grid-like latent representation of 3D shapes. This enables us to represent distributions over 3D shapes conditioned on information from an arbitrary set of spatially anchored query locations and thus perform shape completion in such arbitrary settings (e.g., generating a complete chair given only a view of the back leg). We also show that the learned autoregressive prior can be leveraged for conditional tasks such as single-view reconstruction and language-based generation. This is achieved by learning task-specific naive conditionals which can be approximated by light-weight models trained on minimal paired data. We validate the effectiveness of the proposed method using both quantitative and qualitative evaluation and show that the proposed method outperforms the specialized state-of-the-art methods trained for individual tasks. The project page with code and video visualizations can be found at https://yccyenchicheng.github.io/AutoSDF/.
Summary
- The paper introduces a non-sequential autoregressive prior learned via patch-wise VQ-VAE for versatile 3D shape modeling.
- It leverages a discretized latent space to reliably map partial observations to local geometry, enhancing shape completion performance.
- The method achieves state-of-the-art results in single-view reconstruction and language-guided generation, offering diverse and plausible 3D outputs.
This paper, "AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation" (2203.09516), introduces a method for learning a versatile 3D shape prior using a non-sequential autoregressive model trained on a discrete, low-dimensional representation of 3D shapes. The core idea is that a powerful, generic prior learned from abundant 3D data can significantly improve performance on various conditional 3D tasks, even when task-specific paired data (like image-shape or text-shape pairs) is scarce.
The research translates the challenge of modeling high-dimensional continuous 3D shapes into modeling a distribution over a low-dimensional discrete latent space. This is achieved by first training a Vector-Quantized Variational AutoEncoder (VQ-VAE) specifically designed for 3D shapes, called the Patch-wise Encoding VQ-VAE (P-VQ-VAE).
Discretized Latent Space with P-VQ-VAE
The P-VQ-VAE maps a high-resolution 3D shape representation, specifically a 643 Truncated Signed Distance Field (T-SDF), to a lower-dimensional 83 grid of discrete latent codes. Each element in this 83 grid refers to an entry in a learned codebook Z of size 512, where each code vector has a dimensionality of 256.
A key practical innovation here is the "Patch-wise Encoding." Instead of encoding the entire shape with a single network, the input shape is divided into smaller patches (e.g., 83 sub-volumes within the 643 grid), and each patch is encoded independently. This is crucial for real-world applications like shape completion where only parts of a shape are observed. If the encoder processed the whole shape, partial observations might map to latent codes that don't correspond to any partial region of the full shape's latent code. By contrast, patch-wise encoding ensures that the latent code for a region depends only on the local geometry, making partial shape observations directly correspond to partial latent code observations.
The P-VQ-VAE consists of a 3D convolutional encoder (Eψ) and decoder (Dψ). The encoder uses a series of 3D convolutions, ResNet blocks, and downsampling layers to map the 643 input to a 13 feature volume, which is then expanded to an 83 latent representation by processing patches. The decoder uses 3D convolutions, ResNet blocks, and upsampling layers to reconstruct the 643 T-SDF from the 83 latent code grid. The architecture details, including layer types, kernel sizes, strides, and input/output dimensions, are provided in the supplementary tables (Tables 4 and 5). Training uses the standard VQ-VAE loss, combining reconstruction loss with vector quantization and commitment losses.
Non-sequential Autoregressive Prior
The paper then trains an autoregressive model over the learned 83 grid of discrete latent codes. Unlike standard autoregressive models that rely on a fixed sequential order (like raster scan), this work proposes a "non-sequential" autoregressive model. This is essential because, for tasks like shape completion, the observed parts of a shape (and thus the observed latent codes) can be at arbitrary, spatially disconnected locations in the 83 grid, not necessarily the beginning of a fixed sequence.
The model is a Transformer-based architecture trained to predict the distribution over the next latent code given a random subset of observed latent codes and their locations. By training on random permutations of the latent grid elements in each iteration, the Transformer learns to model the joint distribution p(Z) such that it can predict the category of any latent code zi conditioned on an arbitrary set of other latent codes O. The Transformer consists of 12 encoder layers, 12 attention heads, and a hidden dimension of 768. It uses Fourier features for positional encoding. The model is trained by minimizing the negative log-likelihood of the latent codes over the training dataset, sampling a new random order for each training instance.
During inference, this non-sequential capability allows the model to perform conditional generation from partial latent observations. Given a set of observed latent codes O corresponding to a partial shape, the model can autoregressively sample the remaining unobserved latent codes by querying the Transformer iteratively, conditioning on O and the previously sampled codes. Once the full 83 latent grid is completed, it is passed through the P-VQ-VAE decoder Dψ to reconstruct the final 3D shape.
Conditional Generation Framework
For conditional generation tasks like single-view reconstruction or language-guided generation, the conditioning C is not directly a partial latent code. The paper proposes approximating the complex conditional distribution p(Z∣C) using a product of the learned shape prior and task-specific "naive" conditional distributions:
p(Z∣C)≈pθ(Z)⋅pϕ(Z∣C)
More formally, when factorized autoregressively along a sequence (g1,…,g512), the conditional probability is approximated as:
p(zgj∣zg<j,C)≈pθ(zgj∣zg<j)1−α⋅pϕ(zgj∣C)α
Here, pθ(⋅) is the learned autoregressive prior, and pϕ(⋅∣C) is the naive conditional distribution that predicts the probability of each latent code zi given the conditioning C, independently of other latent codes. α is a hyperparameter balancing the influence of the prior and the naive conditional.
The naive conditional pϕ(zi∣C) is modeled by a task-specific neural network parameterized by ϕ. This network takes the conditioning input C (e.g., image, text) and outputs a categorical distribution over the codebook entries for each location i in the 83 latent grid.
- Image Conditioning: For single-view reconstruction, C is an image. The network ϕ uses a ResNet encoder (ResNet-18) to process the image, followed by linear layers and 3D up-convolutions to produce an 83 grid where each voxel contains a probability distribution over the 512 codebook entries. (Architecture details in Table 6).
- Language Conditioning: For language-guided generation, C is a text description. The network ϕ uses a BERT encoder to process the text, followed by linear layers and 3D up-convolutions, also outputting an 83 grid of probability distributions. (Architecture details in Table 7).
These task-specific networks are trained using paired data (image-shape pairs, text-shape pairs) by maximizing the log-likelihood of the ground truth latent codes zi. The paper argues that these naive conditionals are light-weight and require less paired data because the primary work of modeling the overall shape distribution is handled by the powerful pre-trained autoregressive prior.
During conditional inference, generation proceeds autoregressively using the combined distribution. At each step, to sample the next latent code zgj, the model calculates the distribution using the current partial sequence zg<j and the task-specific conditioning C, according to the factored form (Equation 4 in the paper), and then samples from this distribution.
Practical Applications and Performance
The paper demonstrates the practical utility of this framework on three diverse 3D tasks:
- Shape Completion: Given a partial T-SDF, the observed parts are encoded into partial latent codes. The non-sequential AR prior completes the remaining latent codes, which are then decoded. Experiments on ShapeNet chairs show competitive quantitative results (Table 1) against state-of-the-art point cloud completion methods (MPC, PoinTr) in terms of both fidelity (UHD) and diversity (TMD). Qualitatively (Figures 4, 5), the method generates diverse and plausible shapes that respect the structure of the partial input, even for inputs it wasn't explicitly trained to complete (e.g., specific octants).
- Single-view 3D Prediction (Reconstruction): An image input C is processed by the image inference module ϕ. The combined distribution pθ(Z)⋅pϕ(Z∣I) is used for autoregressive sampling of the latent grid, which is then decoded. Quantitative evaluation on ShapeNet renderings and Pix3D (Table 2) shows state-of-the-art performance compared to deterministic baselines (Pix2Vox, ResNet2TSDF, ResNet2Voxel) and a joint encoding baseline. A key practical benefit shown qualitatively (Figure 6) is the ability to generate multiple plausible reconstructions from a single image, reflecting the inherent ambiguity in single-view prediction, unlike deterministic methods.
- Language-guided Generation: A text input C is processed by the language inference module ϕ. The combined distribution pθ(Z)⋅pϕ(Z∣T) is used for latent grid sampling. Evaluation using a neural evaluator on the ShapeGlot dataset (Table 3) demonstrates that shapes generated by AutoSDF align significantly better with text descriptions compared to baselines (Text2Shape, Joint-Encoding). Qualitative results (Figures 7, 8) show the generation of diverse and realistic shapes consistent with text descriptions, even for less common descriptions or those referring to specific parts.
Implementation Considerations and Limitations
- Computational Resources: Training the entire system involves training a VQ-VAE, a large Transformer, and task-specific networks, requiring substantial computational power (GPUs) and time. Inference, being autoregressive, involves sequential sampling of latent codes, which can be slower than feed-forward deterministic methods, especially for larger latent grids.
- Data: While the task-specific conditionals are designed to be light-weight and trainable with less paired data, the core shape prior (VQ-VAE and Transformer) relies on large datasets of 3D shapes (ShapeNet).
- Representation: The method is tied to volumetric representations (TSDFs). Applying it to other 3D formats like meshes or newer neural implicit representations would require significant adaptation or a different VQ-VAE structure.
- Alignment: The approach might be sensitive to the alignment and normalization of input shapes. Consistent pre-processing is necessary.
- Prior Bias: The learned prior reflects the distribution of shapes in the training data (ShapeNet), primarily artificial CAD models. Generating shapes outside of these categories is not supported.
- Approximation: The factorization of the conditional distribution as a product of the prior and naive conditional is an approximation. While effective in the low-paired-data regime, it might not fully capture complex dependencies and could be suboptimal compared to training a single end-to-end model on very large task-specific paired datasets (if available).
In summary, AutoSDF presents a practical framework for unified 3D shape generation and inference by leveraging a powerful non-sequential autoregressive prior learned over a patch-encoded discrete latent space. This allows flexible conditioning on various input modalities, outperforming specialized methods in several multi-modal tasks and offering capabilities like diverse multi-modal generation.
Related Papers
- SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation (2022)
- Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences (2021)
- NAISR: A 3D Neural Additive Model for Interpretable Shape Representation (2023)
- PT43D: A Probabilistic Transformer for Generating 3D Shapes from Single Highly-Ambiguous RGB Images (2024)
- Learning Compositional Shape Priors for Few-Shot 3D Reconstruction (2021)