AutoSDF: Learned Shape Priors for 3D Modeling
- AutoSDF is a framework that uses learned SDF representations and symbolic latent encodings for effective 3D shape completion, reconstruction, and generation.
- It employs a VQ-VAE model and an autoregressive transformer prior to capture fine geometric details, achieving superior performance on benchmarks like ShapeNet and Pix3D.
- The methodology enables robust conditional inference from partial, noisy data and integrates multi-modal cues from images and text to guide shape generation.
AutoSDF refers to a family of methodologies leveraging learned shape priors encoded as Signed Distance Fields (SDFs) or their low-dimensional symbolic representations for 3D reasoning tasks. Across the primary literature, AutoSDF techniques encompass generative and discriminative pipelines for 3D shape completion, scene annotation, reconstruction from limited or ambiguous data, and conditional generation guided by modalities such as images or language (Mittal et al., 2022, Zakharov et al., 2019). While methodologies vary, all AutoSDF approaches are united by the use of geometric priors that allow inference even from severely partial or noisy observations, with core modules including neural SDF decoders, symbolic latent encodings (often via vector quantization), and compositional autoregressive priors.
1. Symbolic and Implicit Shape Representation
The representation backbone of AutoSDF is the Signed Distance Field (SDF), a continuous implicit surface model that predicts for any spatial query and shape code the signed distance to the object's surface (Zakharov et al., 2019). In generative pipelines, volumetric variants such as truncated SDFs (TSDF) on a grid (with ) are discretized for compatibility with patchwise vector-quantized variational autoencoders (P-VQ-VAE). The encoder divides the volume into patches, independently encodes each to a 256-dimensional vector, and assigns symbolic indices via nearest-codebook quantization (Mittal et al., 2022). This yields a compact latent tensor , effecting a dimensionality collapse while preserving fine surface geometry (cf. supplemental Fig. S1 in (Mittal et al., 2022)).
The VQ-VAE loss function combines reconstruction, vector-quantization, and commitment penalties: where denotes the stop-gradient operator, and .
2. Autoregressive Prior over Symbolic Grids
AutoSDF introduces a learned prior over the symbolic latent space using an autoregressive regime. Rather than imposing a fixed raster order, the prior factorizes over arbitrary spatial permutations : (training objective: minimize negative log-likelihood across random permutations, see Eq. (2) in (Mittal et al., 2022)). The prior is modeled as a transformer , allowing each step to be conditioned on arbitrary spatially-anchored prior tokens—a strict generalization of sequential raster ordering. Inputs to include observed token values and their 3D coordinates (Fourier-embedded).
For shape completion, given a partial observation encoded as a subset , the completion posterior is sampled as: The completed shape is reconstructed by decoding .
3. Conditioning and Multi-modal Inference
For conditional tasks, AutoSDF approximates the true conditional factorization (for context such as an image or text) by a multiplicative or "naive" factorization: (see Eq. (4) in (Mittal et al., 2022)). The task-specific network is trained per-index via cross-entropy for each , with distinct backbones for visual (ResNet-18) and language (BERT+MLP) conditioning (cf. Figure 1). At inference, tokens are sampled with weighted mixture logits,
with tuned by task.
4. Training and Inference Pipelines
The AutoSDF pipeline comprises the following stages:
- VQ-VAE Training: Minimize on ShapeNet CAD shapes (13 object classes), codebook size 512, embedding dimension 256, patch size yielding latent grid (Mittal et al., 2022).
- Autoregressive Prior: Train a 12-layer, 12-head, transformer on random spatial permutations, minimizing cross-entropy for all predictions.
- Naive-Conditional Modules: For each task, train the appropriate network to maximize log-likelihood across latent indices using available context-annotation pairs (image/TSDF or text/TSDF).
- Inference Algorithms: Provided as explicit pseudocode.
- ShapeCompletion(): Encode observed tokens, sequentially sample missing ones by autoregressive prior, then decode.
- ConditionalGeneration(): Evaluate naive conditionals, combine with prior, sample autoregressively, decode to TSDF.
5. Empirical Evaluation
Shape Completion: On ShapeNet chairs with severely partial input (bottom half or octant), the method achieves unidirectional Hausdorff distance (UHD) of 0.0567 and total mutual diversity (TMD) of 0.0341, outperforming MPC [Wu20] and PoinTr [Yu21]; see Tab. 1 in (Mittal et al., 2022). Qualitative results demonstrate preservation of partial geometric cues and higher diversity of plausible outputs (Figs. 4–5).
Single-view Reconstruction: Benchmarked on ShapeNet and Pix3D, AutoSDF achieves IoU 0.577 (CD 1.331, F-score@1% 0.414) on ShapeNet, outperforming ResNet2TSDF (IoU 0.554), Pix2Vox, and joint autoregressive baselines; similar trends on Pix3D; see Tab. 2 and Fig. 6.
Language-guided Generation: On ShapeGlot, a learned neural evaluator ("Listener") prefers AutoSDF completions over Text2Shape [Chen18] in 66% of pairs (vs. 18% for baseline, 16% confused cases), and over a joint encoder (61% vs. 23%); see Tab. 3 and Fig. 7.
6. Limitations and Future Prospects
The naive conditional factorization does not model higher-order correlations between locations, and thus exhibits suboptimal performance when large paired datasets are available for end-to-end joint modeling. All current AutoSDF methods require gridded SDF/voxel representations; extension to mesh representations or continuous implicit function models is an open direction. Results and priors are sensitive to the canonical alignment of input CAD shapes; generalization to organically shaped or in-the-wild scanned objects remains to be demonstrated. Performance on downstream tasks such as detection with autolabels is subject to a drop in certain detection metrics compared to ground truth labels, particularly for monocular and 3D evaluation protocols (Zakharov et al., 2019).
7. Relationship to Other SDF-based Autolabeling Pipelines
AutoSDF also refers to methods for autolabeling 3D objects by optimizing SDF-based shape priors against 2D and sparse 3D measurements (Zakharov et al., 2019). These approaches leverage implicit shape models (e.g., DeepSDF), differentiable rendering pipelines (surface-tangent disc surfels, NOCS color projection), and combined 2D/3D alignment losses to refine cuboid and shape predictions for dataset annotation (e.g., KITTI3D). Performance is maintained via curriculum learning, alternating between automatic label generation and CSS net retraining, with performance saturating after two self-improvement loops. The method recovers cuboids and shapes from detector outputs with high accuracy in both 2D (BEV) and 3D (IoU, nuScenes metrics).
These two lines of AutoSDF research demonstrate the breadth of SDF priors in 3D vision, uniting generative shape modeling and discriminative annotation pipelines via the use of learned geometric structure priors.
References:
- "AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation" (Mittal et al., 2022)
- "Autolabeling 3D Objects with Differentiable Rendering of SDF Shape Priors" (Zakharov et al., 2019)