Autoregressive 3D Generation

Updated 23 April 2026

Autoregressive 3D generation is a sequential modeling approach that synthesizes 3D structures by factorizing their probability into conditioned, next-step predictions.
It leverages transformer decoders and discrete tokenization to compress and reconstruct high-dimensional 3D data through hierarchical, multiscale architectures.
This method enables efficient sampling and controllable generation of complex objects, often outperforming diffusion models in speed and fidelity.

Autoregressive 3D generation defines a family of generative modeling approaches in which 3D objects, scenes, or molecular structures are synthesized sequentially via next-step prediction, conditioned on previous outputs. Unlike diffusion or variational models that often favor holistic or iterative reconstruction, the autoregressive (AR) paradigm factorizes the data distribution as a chain of conditional probabilities, supporting exact likelihoods, flexible conditioning, and controllable, step-by-step construction of complex 3D configurations.

1. Mathematical Foundations and Autoregressive Factorization

Autoregressive 3D models decompose the joint probability of a structure into sequential conditional factors, enabling tractable likelihood computation and incremental synthesis. Concretely, for a generically ordered sequence of tokens $t_1, \ldots, t_N$ encoding the 3D data (such as atom types and coordinates, octree nodes, mesh tokens, or codebook indices), the probability is expressed as: $p(t_1,\ldots,t_N) = \prod_{i=1}^N p(t_i \mid t_{<i})$ This factorization allows either classic "next-token" autoregression—predicting the next symbol given all previous ones—or "next-scale" autoregression, which hierarchically predicts blocks or scales of tokens conditioned on prior coarser information (Chen et al., 2024, Medi et al., 2024, Meng et al., 11 Mar 2025). For molecular 3D generation, specialized factorization may be used: $p(a_{1:n},\,x_{1:n}) = \prod_{i=0}^{n-1} p(a_{i+1}, x_{i+1} \mid a_{1:i}, x_{1:i})$ where $a_i$ and $x_i$ denote discrete atom types and continuous positions, respectively (Cheng et al., 20 May 2025).

Tokenization schemes are model-dependent. For continuous-space data, key structures include:

Sequences of atom types and 3D positions for molecules (Cheng et al., 20 May 2025, Li et al., 31 Oct 2025)
Hierarchical or multiscale point-cluster/voxel/octree or wavelet representations for shapes and scenes (Wei et al., 14 Apr 2025, Medi et al., 2024, Chen et al., 2024, Ibing et al., 2021, Cheng et al., 2022, Meng et al., 11 Mar 2025)
Modular part-based or asset-based sequences for engineered objects (Zhu et al., 12 Feb 2026, Li et al., 2024, Chen et al., 17 Jul 2025)
Meshes and surface elements as 1D sequences by quantization and face/vertex orderings (Li et al., 29 Jan 2026)

2. Model Architectures and Representation Strategies

AR 3D generators are unified by sequential backbone architectures—most commonly transformer decoders—with domain-specific preprocessing, embedding, and sequence design.

Canonical sequence definition and geometric equivariance: For molecules and atomic structures, canonicalization of atom order and pose is crucial. Inertial frame alignment, canonical Weisfeiler–Lehman labeling, and permutation-invariant tokenization are directly implemented in advanced models such as InertialAR (Li et al., 31 Oct 2025) and Uni-3DAR (Lu et al., 20 Mar 2025).

Discrete codebook or VQ-based hierarchies: High-dimensional 3D data are commonly compressed using multi-scale residual vector quantization or VQ-VAEs, producing sequenceable discrete token streams for efficient transformer-based AR modeling. This enables aggressive compression (e.g., up to 2000 $\times$ in (Zhang et al., 2024)) and tractable sequence lengths for high-resolution shapes.

Continuous vs. discrete output decomposition: For molecules, AR approaches often hybridize, using transformers to generate discrete atom types and auxiliary MLPs or diffusion heads to model conditional continuous coordinates (Cheng et al., 20 May 2025, Li et al., 31 Oct 2025). In "next-scale" AR, coarse voxel/point/octree blocks are autoregressively predicted, then refined via deterministic or diffusion upsampling (Chen et al., 2024, Medi et al., 2024, Rasoulzadeh et al., 2024).

Geometric positional encoding: For spatial coherence, models employ 3D rotary positional encodings (RoPE3D) (Li et al., 31 Oct 2025, Wei et al., 14 Apr 2025, Lützow et al., 27 Mar 2026), absolute or relative 3D coordinate embeddings, or block/scale-specific embeddings to guide self-attention and inductive bias for Euclidean structure.

3. Learning, Inference, and Sampling Efficiency

Unlike diffusion models requiring iterative denoising over all tokens per timestep, AR 3D generation typically requires only a single expensive network forward per generated step, conferring significant efficiency advantages (Cheng et al., 20 May 2025, Qian et al., 2024, Rasoulzadeh et al., 2024):

Quetzal achieves 22.5–128 $\times$ speedup over diffusion baselines on molecules (Cheng et al., 20 May 2025).
OctGPT realizes a 69 $\times$ faster sampling regime via multi-token MAR decoding (Wei et al., 14 Apr 2025).
HiFi-Mesh unlocks 6 $\times$ longer sequences and 3 $\times$ faster generation via latent hierarchical AR (Li et al., 29 Jan 2026).

Most AR models optimize cross-entropy losses over discrete next-step predictions, possibly augmented by variational, commitment, or denoising score matching losses for continuous structures. Exact likelihoods are directly computable (using e.g., the trace of a 3 $p(t_1,\ldots,t_N) = \prod_{i=1}^N p(t_i \mid t_{<i})$ 03 Jacobian per-atom for molecular DiffMLP heads (Cheng et al., 20 May 2025)), enabling unbiased model evaluation, Bayesian inference, and applications to energy-based modeling.

Blockwise or next-scale AR (generating a set of tokens per scale rather than a single token) further accelerates sampling and reduces sample inefficiency. This approach is demonstrated to reduce latency from several seconds for full 3D diffusion to sub-second synthesis for high-fidelity objects (Chen et al., 2024, Zhang et al., 2024, Medi et al., 2024).

4. Applications: Molecular, Shape, Asset, and Part-Based Generation

Autoregressive 3D generation encompasses a spectrum of domains:

Molecular Structure Synthesis: Quetzal and InertialAR represent state-of-the-art AR methods for 3D molecule generation, supporting atom-by-atom or canonical SE(3)-invariant sequence generation, scaffold completion, and arbitrary-length sampling (Cheng et al., 20 May 2025, Li et al., 31 Oct 2025). Uni-3DAR extends these capabilities to large molecules, crystals, and proteins using hierarchical octree compression (Lu et al., 20 Mar 2025).

Point Cloud and Shape Synthesis: PointARU (Meng et al., 11 Mar 2025), Canonical Mapping (Cheng et al., 2022), and G3PT (Zhang et al., 2024) define a family of AR models utilizing multi-scale, compositionally ordered codebooks, transformer-upsampling, and cross-scale querying. These approaches sidestep unordered point-set challenges by autoregressively predicting coarse-to-fine token maps, facilitating state-of-the-art completion and unconditioned generation.

Voxel, Octree, and Wavelet-Based Generation: Octree Transformers and OctGPT leverage spatially compact octree sequences reducible to tractable lengths using adaptive convolutional compression and multi-scale tokenization (Ibing et al., 2021, Wei et al., 14 Apr 2025). Wavelet-guided AR (3D-WAG) and hierarchical diffusion-based upsampling models (e.g., ArchComplete) use multi-resolution decomposition, enabling efficient reconstruction and improved detail (Medi et al., 2024, Rasoulzadeh et al., 2024).

Mesh, Part, and Asset Generation: AR models for mesh and part-based data sequence quantized mesh tokens, asset primitives, or SDF latent representations (Zhu et al., 12 Feb 2026, Li et al., 29 Jan 2026, Li et al., 2024, Chen et al., 17 Jul 2025). HiFi-Mesh demonstrates scalable mesh synthesis with sub-linear complexity via compact AR dependencies and parallel decoding (Li et al., 29 Jan 2026). AR asset models (AssetFormer) and part generators (AutoPartGen) generalize modular design, enabling user-guided zero-shot shape editing, scene completion, and UGC pipeline integration (Zhu et al., 12 Feb 2026, Chen et al., 17 Jul 2025).

5. Evaluation, Comparative Performance, and Scaling Laws

State-of-the-art AR 3D generators report highly competitive or even superior sample fidelity, diversity, and structural validity compared to diffusion models:

Molecules: Quetzal and InertialAR achieve near-perfect atom and molecule stability; InertialAR achieves state-of-the-art on 7/10 unconditional metrics and all five conditional targets across QM9, GEOM-Drugs, and B3LYP (Li et al., 31 Oct 2025, Cheng et al., 20 May 2025). Uni-3DAR improves QM9 molecule stability from 82.0% (EDM) to 93.7%, and GEOM-DRUG validity from 92.6% (EDM) to 99.4% (Lu et al., 20 Mar 2025).
Shapes/Point Clouds: PointARU, 3D-WAG, and G3PT approach or surpass diffusion-like coverage and MMD/CD while markedly reducing computation time (e.g., 3D-WAG achieves similar coverage as UDiff at 2.5 s vs 39 s/shape (Medi et al., 2024); G3PT (1.5B) achieves F-score 83.0 vs. 63.4 for CLAY diffusion (Zhang et al., 2024)).
Meshes/Assets: HiFi-Mesh attains a 0.075 $p(t_1,\ldots,t_N) = \prod_{i=1}^N p(t_i \mid t_{<i})$ 1 Chamfer, far outperforming baselines (0.501–3.413), with 81.17% user preference (Li et al., 29 Jan 2026); AssetFormer achieves FID ≈55, CLIP score ≈0.32, outpacing procedural and mesh diffusion methods (Zhu et al., 12 Feb 2026).

Scaling behavior: G3PT is the first to demonstrate power-law scaling of AR loss with model size in 3D generation, confirming the optimization and sample-efficiency benefits when scaling transformer backbones to the order of billions of parameters (Zhang et al., 2024).

6. Limitations, Open Challenges, and Future Directions

Despite rapid progress, open issues remain:

Ordering and permutation dependence: Fixed sequence orderings (in .xyz or arbitrary canonicalizations) can limit generalization and cause order bias (Cheng et al., 20 May 2025). Some models fail under order randomization; canonical approaches (e.g., SE(3)-invariant tokenization (Li et al., 31 Oct 2025, Lu et al., 20 Mar 2025)) partially resolve this.
Error accumulation: Stepwise prediction errors accumulate, especially affecting early predictions and large structures where global geometric consistency is critical (Cheng et al., 20 May 2025, Li et al., 29 Jan 2026).
Scale and complexity: AR models tend to be costlier for very large shapes or scenes unless highly compressed representations (e.g., VAT, OctGPT) or next-scale AR are leveraged (Zhang et al., 2024, Wei et al., 14 Apr 2025, Chen et al., 2024). For scenes or articulated objects, more sophisticated context modeling and long-horizon compositional structure is required (Lützow et al., 27 Mar 2026, Wu et al., 14 Mar 2026).
Conditioning and controllability: While text, image, shape, or partial observations can be incorporated via cross-attention or prompt prepending, further advances are needed for real-world robustness, unified multimodal models, and continuous-parameter or hybrid discrete/continuous outputs (Zhu et al., 12 Feb 2026).
Generalization and variable size: Many AR models, particularly those with fixed-scale or sequence length, cannot generate objects larger than training data or out-of-distribution part topologies without architectural modification (Cheng et al., 20 May 2025).

Research avenues include order-agnostic or masked-diffusion AR, hybrid AR-diffusion models for resolving error accumulation, and unified next-step multimodal transformers for joint 3D generation and understanding (Cheng et al., 20 May 2025, Chen et al., 2024, Lu et al., 20 Mar 2025).

Key references: