Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-sequential Autoregressive Shape Priors

Updated 23 February 2026
  • The paper introduces a novel framework combining vector-quantized autoencoding with transformer-based autoregressive modeling for 3D shape inference.
  • It employs grid-based latent representations and non-fixed sampling orders to condition on arbitrary spatial subsets.
  • Experimental results demonstrate improved shape completion metrics and multimodal generation capabilities over state-of-the-art baselines.

Non-sequential autoregressive shape priors are probabilistic models designed to capture the joint distribution of 3D geometric object representations, enabling flexible, multimodal inference conditioned on partial or arbitrary observations. By combining grid-based discretizations, vector-quantized autoencoding, and transformer-based non-sequential autoregressive modeling, these priors facilitate tasks such as shape completion, generation, and conditional reconstruction across image, language, or partial shape inputs. The central feature is an autoregressive factorization over latent shape codes where the sampling/conditioning order is not fixed (non-sequential), allowing modeling to accommodate arbitrary observed spatial patterns, a property crucial for 3D reasoning under uncertainty and limited evidence (Mittal et al., 2022).

1. Discretized Grid-based Latent Representations

Non-sequential autoregressive shape priors operate on a structured, symbolic latent encoding of 3D shapes. The input is a Truncated Signed Distance Field (T-SDF), sampled on a uniform 64364^3 grid. Shapes are compressed using a Vector-Quantized Variational Autoencoder (VQ-VAE) with a patchwise encoder EψE_\psi that splits the input into 838^3 non-overlapping subvolumes. Each subvolume is projected to a codebook index, yielding a latent tensor Z{1,...,K}8×8×8Z \in \{1,...,K\}^{8\times8\times8}, where KK is the codebook size. The full shape is then summarized by the collection of these indices, and a shared decoder DψD_\psi reconstructs the T-SDF from ZZ. The VQ-VAE is trained with standard reconstruction, codebook, and commitment losses (Mittal et al., 2022).

2. Non-sequential Autoregressive Prior Factorization

The core methodological advance is learning a joint prior P(Z)P(Z) over the latent grid ZZ that is factorized as

P(Z)=Pr(z1,,zN)=j=1NPθ(zgjzg1,,zgj1)P(Z) = \Pr(z_1, \ldots, z_N) = \prod_{j=1}^N P_\theta(z_{g_j} \mid z_{g_1}, \ldots, z_{g_{j-1}})

where (g1,...,gN)(g_1, ..., g_N) is a randomly chosen permutation of all N=512N=512 latent positions (for d=8d=8). During training, every batch samples a new permutation, and a Transformer TθT_\theta is trained with upper-triangular masking to predict the next code conditioned on any arbitrary, previously chosen subset. Positional encoding via Fourier features conveys 3D cell coordinates. The objective is maximum likelihood across all permutations, supporting conditioning on any subset of cells at inference time (Mittal et al., 2022).

3. Task-specific Conditional Models and Inference

The non-sequential prior can be combined with task-specific conditionals:

pϕ(ziC)p_\phi(z_{\mathbf i}\mid C)

for each latent cell i\mathbf i, where CC is task-dependent conditioning (e.g., image, text, or observed shape part). These conditionals are modeled independently for each cell using compact neural networks, such as ResNet-18 (for image) or BERT (for text), each trained to maximize the likelihood of the code given the conditioning, using minimal paired data. At inference, a factorized conditional distribution is approximated by fusing the prior and conditional term:

P(ZC)j=1N[Pθ(zgjzg<j)]1α[pϕ(zgjC)]αP(Z | C) \approx \prod_{j=1}^N [P_\theta(z_{g_j} \mid z_{g_{<j}})]^{1-\alpha} [p_\phi(z_{g_j} \mid C)]^\alpha

for some hyperparameter α[0,1]\alpha \in [0,1] (e.g., $0.75$ for images, $0.5$ for language). Sampling proceeds autoregressively over a random ordering, leveraging both the generic shape prior and the task-specific conditional (Mittal et al., 2022).

4. Experimental Evaluation

Comprehensive evaluation demonstrates the expressiveness and versatility of non-sequential autoregressive shape priors. In shape completion tasks on ShapeNet (e.g., bottom-half and octant settings), fidelity is measured via Unidirectional Hausdorff Distance (UHD \downarrow), and diversity via Total Mutual Difference (TMD \uparrow). The model outperforms specialized baselines (e.g., UHD $0.0567$ vs $0.0572$ for MPC; TMD $0.0693$ vs $0.0376$ for PoinTr). Single-view reconstruction yields higher IoU, lower Chamfer distance, and better F-Score than state-of-the-art methods (e.g., IoU $0.577$ vs $0.486$ for Image-JE; CD $1.331$ vs $1.972$; F-Score $0.414$ vs $0.338$). For language-conditioned generation on ShapeGlot, the model is preferred 66% to 18% over Text2Shape. Multimodality is supported by the ability to generate multiple plausible completions from the same input (Mittal et al., 2022).

5. Advantages and Generalization Properties

Non-sequential autoregressive shape priors present several key advantages:

  • Conditioning on arbitrary spatial subsets without retraining, achieving universal applicability across completion, inpainting, and synthesis tasks.
  • A single, generic prior suffices across multiple modalities and tasks, requiring only lightweight supervision for new modalities.
  • Multimodal output generation, capturing diverse plausible shapes under ambiguous or limited evidence.

However, this approach is grid-based, limiting direct application to meshes or continuous implicit fields and requiring global alignment of input shapes (canonical frame). Approximate factorization of the conditional may underperform fully joint modeling when sufficient paired data is available. Biases can arise if training shapes are concentrated in a few categories (Mittal et al., 2022).

Non-sequential autoregressive modeling extends prior work in sequential factorized probabilistic shape modeling by relaxing the rigid coordinate ordering and enabling flexible inference over partial observations. The approach leverages Transformer architectures, drawing on advances in non-local attention and unordered sequence modeling. It contrasts with models where the factorization is fixed and conditional observation subsets are not directly accommodated. The paradigm is particularly relevant for applications demanding shape completion from arbitrary viewpoints, partial scans, or uncertain context, supporting research in robotic perception, reverse engineering, and rapid prototyping. Continued work aims to extend these methods beyond voxel grids to mesh, point cloud, or implicit representations, with challenges in scalability and alignment (Mittal et al., 2022).

7. Limitations and Future Directions

Limitations include reliance on grid-aligned T-SDF data, approximate conditioned inference, and performance sensitivity to global shape alignment. Extending the non-sequential approach to more general geometric and topological domains remains nontrivial. Future work targets adaptation to continuous or mesh-based representations, mitigation of category bias, and integration of richer modalities or context. Analysis of learned latent structures and interpretability of multimodal outcomes are promising research directions (Mittal et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Non-sequential Autoregressive Shape Priors.