Papers
Topics
Authors
Recent
Search
2000 character limit reached

Autoregressive Polygon Decoders in 3D Mesh Generation

Updated 10 March 2026
  • Autoregressive polygon decoders are neural architectures that sequentially generate 3D meshes by predicting vertices and faces using discrete tokenization and Transformer-based models.
  • They utilize explicit mesh representations with quantized coordinates and special tokens, achieving metrics such as 85-90% prediction accuracy and low bits/vertex values.
  • Conditioning on class, image, or voxel inputs allows these models to reconstruct diverse and detailed meshes, matching real data distributions and improving reconstruction quality.

Autoregressive polygon decoders refer to neural architectures that generate 3D mesh representations by sequentially predicting mesh vertices and faces, typically using an autoregressive probabilistic model. This approach directly models the discrete polygonal mesh structure, in contrast to earlier methods that employ alternative representations such as voxels or implicit functions. A canonical example is PolyGen, which leverages a two-stage Transformer-based architecture to produce polygon meshes suitable for graphics, robotics, and related tasks (Nash et al., 2020).

1. Mesh Representation and Tokenization

Autoregressive polygon decoders operate on explicit mesh representations (M=(V,F))(M = (V, F)), where VV is the ordered set of vertices and FF the set of polygonal faces. PolyGen adopts an 8-bit uniform quantization of each 3D coordinate, mapping every xx, yy, and zz value to one of 256 discrete bins. Two special tokens, “s” (end-of-sequence) and “n” (new-face marker, used only in face decoding), demarcate structural boundaries.

Vertex Sequence Construction:

Vertices are sorted by ascending zz-coordinate, then yy, then xx, and serialized into a 1D sequence as

Vseq=[z1,y1,x1,z2,y2,x2,...,zNV,yNV,xNV,s]V_{seq} = [z_1, y_1, x_1, z_2, y_2, x_2, ..., z_{N_V}, y_{N_V}, x_{N_V}, s]

where zi,yi,xi{0,,255}z_i, y_i, x_i \in \{0,\ldots,255\} or ss.

Face Sequence Construction:

After vertices are sampled, faces are enumerated by increasing order of their smallest vertex index. Within each face, indices are cyclically rotated so that the smallest index appears first. These are flattened and distinguished by “n” markers and terminated by “s”:

Fseq=[f1(1),...,fN1(1),n,f1(2),...,fN2(2),n,...,s]F_{seq} = [f_1^{(1)}, ..., f_{N_1}^{(1)}, n, f_1^{(2)}, ..., f_{N_2}^{(2)}, n, ..., s]

where each fkf_k is a vertex index from $1$ to NVN_V.

Model inputs to the Transformer include learned value embeddings (mapping quantized values or special tokens to DD-dimensional vectors), coordinate-type embeddings (distinguishing xx, yy, zz), and position embeddings (indexing vertex or face sequence position).

2. Autoregressive Probability Modeling

The underlying generative framework factorizes mesh probability as

p(M)=p(V)p(FV)p(M) = p(V) \cdot p(F|V)

with sequential expansion within each phase: p(Vseq;θ)=t=1LVp(vtv<t;θ) p(FseqV;θ)=t=1LFp(ftf<t,V;θ)\begin{align*} p(V_{seq};\theta) &= \prod_{t=1}^{L_V} p(v_t | v_{<t};\theta) \ p(F_{seq}\mid V;\theta) &= \prod_{t=1}^{L_F} p(f_t \mid f_{<t},V;\theta) \end{align*} where LV=3NV+1L_V = 3N_V + 1 and LF=iNi+Nfaces+1L_F = \sum_i N_i + N_\text{faces} + 1. During sampling, first VseqV_{seq} is generated until “s” is emitted and decoded as the vertex set VV. With VV fixed, the face sequence FseqF_{seq} is then autoregressively sampled and decoded to generate polygon faces FF. Both steps are formally probabilistic, supporting multiple plausible outputs for a given condition.

3. Transformer Architecture and Conditioning

The decoder stack for vertices employs 18 layers of decoder-only Transformers (hidden dim D=256D=256, 8 heads), with masked self-attention restricting access to prior tokens v<tv_{<t}. Face prediction uses an encoder-decoder configuration: an encoder over all vertex-coordinates yields contextual embeddings EvR(NV+2)×DE_v \in \mathbb{R}^{(N_V+2) \times D} (covering actual vertex-coordinates and the two special tokens), and a 12-layer Transformer decoder operates over FseqF_{seq} with cross-attention on EvE_v.

Conditioning is accomplished by introducing additional signals into the generation stack:

  • Class-conditioning: Each class is assigned a learned embedding, projected to DD dimensions and incorporated into each Transformer layer.
  • Image-conditioning: A 2D ResNet-style encoder produces (16×16) spatial feature maps, reshaped to a 256-length sequence for cross-attention.
  • Voxel-conditioning: Analogous to image, but with a 3D ResNet yielding (7×7×7)×DD embeddings.

These mechanisms allow the model to synthesize meshes from diverse upstream signals, enabling reconstruction from images, voxels, or class contexts.

4. Training, Optimization, and Metrics

Training maximizes the joint log-likelihood over a dataset of meshes {M(i)}\{M^{(i)}\}, with the negative log-likelihood objective:

L(θ)=i[logp(Vseq(i);θ)+logp(Fseq(i)V(i);θ)]L(\theta) = \sum_i [\log p(V_{seq}^{(i)};\theta) + \log p(F_{seq}^{(i)}|V^{(i)};\theta)]

To enable comparison across variable-sized meshes, bits-per-vertex is reported:

bits/vertex=log2p(M)NV\text{bits/vertex} = \frac{-\log_2 p(M)}{N_V}

Unconditional Baselines (ShapeNet):

  • Uniform valid baseline: 21.41 bits/vertex (vertices), 25.79 bits/vertex (faces)
  • Draco (8-bit quant.): ≈27.68 bits/vertex (total)
  • PolyGen achieves: 2.46 bits/vertex (vertices, ≈85% accuracy), 1.82 bits/vertex (faces, ≈90% accuracy), total 4.28 bits/vertex

For mesh reconstruction from conditionals, symmetric Chamfer distance (CD) between two meshes’ point clouds is used:

Lchamfer(P,Q)=pPminqQpq2+qQminpPpq2\mathcal{L}_\text{chamfer}(P, Q) = \sum_{p\in P}\min_{q\in Q}\|p-q\|^2 + \sum_{q\in Q}\min_{p\in P}\|p-q\|^2

where P,QP, Q are point clouds sampled from meshes.

PolyGen’s conditional mesh generation from images or voxels achieves next-step accuracy of 85.7%85.7\% (vertices) and 90.0%90.0\% (faces) for image conditionals, with bits/vertex 4.11, and analogous performance for class- and voxel-conditioned setups.

5. Inference, Sampling, and Uncertainty Capture

At inference, vertices are autoregressively sampled token-by-token, with logits computed at each step by the Transformer stack and mapped via softmax to obtain sampling probabilities. A similar procedure is used for faces, using cross-attention on previously determined vertices. The sampling pseudocode is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
V_seq = []
for t in 1..max_len:
    logits = TransformerVertex(V_seq)
    probs = softmax(logits)
    v_t = sample_from(probs)  # supports top-p nucleus sampling
    if v_t == "s": break
    append(V_seq, v_t)
Decode V_seq  vertex set V

F_seq = []
for t in 1..max_len:
    ptr_logits = TransformerFace(F_seq; encoder_outputs=V_emb)
    ptr_probs = softmax(ptr_logits)
    f_t = sample_from(ptr_probs)
    if f_t == "s": break
    append(F_seq, f_t)
Decode F_seq  face list F

return Mesh(V, F)

Nucleus (top-pp) sampling, with p=0.9p=0.9, is employed to avoid low-probability tails, empirically yielding more realistic mesh diversity. The probabilistic formulation allows ensemble sampling from a single input to capture multi-modal ambiguities inherent in, for example, image-based mesh inference. This supports greater diversity and plausible structural variation in outputs.

6. Performance, Evaluation, and Use Cases

PolyGen’s autoregressive polygon decoder produces mesh samples that match the empirical distributions of real data across vertex count, face count, average node degree, average face area, and average edge length. For unconditional sample statistics, the distribution of PolyGen samples aligns with the ShapeNet test set.

Qualitatively, generated meshes are directly usable, typically requiring no post-processing or triangulation. Notably, PolyGen produces large flat polygons and exhibits consistent part symmetries where present in the data. In conditional reconstruction (e.g., generating a mesh from a single image), a single PolyGen sample lags AtlasNet (a competing patch-based reconstruction baseline) in Chamfer distance, but the best of ten samples matches or slightly outperforms AtlasNet. This suggests that diverse sampling enables competitive or superior mesh quality in ambiguous or multi-modal settings (Nash et al., 2020).

Summary evaluation metrics for PolyGen:

Conditioning Bits/vertex Vertex Acc. Face Acc.
Class-conditional 4.24 85.3% 89.9%
Image-conditional 4.11 85.7% 90.0%
Voxel-conditional 4.01 85.9% 90.0%

A plausible implication is that autoregressive polygon decoders offer a unified framework for both unconditional mesh synthesis and conditional mesh reconstruction, delivering strong performance across different scenarios without training on task-specific metrics.

Autoregressive polygon decoders illustrate a paradigm shift, modeling explicit mesh structures with deep autoregressive architectures rather than alternative volumetric or implicit approaches. This methodology may support integration with downstream applications needing high-quality, directly usable mesh outputs. Extensions could involve alternative quantization strategies, further architectural scaling, or incorporating more complex mesh attributes.

PolyGen defines the current benchmark for end-to-end probabilistic 3D mesh modeling via an autoregressive polygon decoder (Nash et al., 2020). Future research may explore enhancements including richer conditioning signals, tighter integration of reconstruction loss terms, or improved sampling strategies to further close the gap on task-directed reconstruction metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Autoregressive Polygon Decoders.