Autoregressive Polygon Decoders in 3D Mesh Generation
- Autoregressive polygon decoders are neural architectures that sequentially generate 3D meshes by predicting vertices and faces using discrete tokenization and Transformer-based models.
- They utilize explicit mesh representations with quantized coordinates and special tokens, achieving metrics such as 85-90% prediction accuracy and low bits/vertex values.
- Conditioning on class, image, or voxel inputs allows these models to reconstruct diverse and detailed meshes, matching real data distributions and improving reconstruction quality.
Autoregressive polygon decoders refer to neural architectures that generate 3D mesh representations by sequentially predicting mesh vertices and faces, typically using an autoregressive probabilistic model. This approach directly models the discrete polygonal mesh structure, in contrast to earlier methods that employ alternative representations such as voxels or implicit functions. A canonical example is PolyGen, which leverages a two-stage Transformer-based architecture to produce polygon meshes suitable for graphics, robotics, and related tasks (Nash et al., 2020).
1. Mesh Representation and Tokenization
Autoregressive polygon decoders operate on explicit mesh representations , where is the ordered set of vertices and the set of polygonal faces. PolyGen adopts an 8-bit uniform quantization of each 3D coordinate, mapping every , , and value to one of 256 discrete bins. Two special tokens, “s” (end-of-sequence) and “n” (new-face marker, used only in face decoding), demarcate structural boundaries.
Vertex Sequence Construction:
Vertices are sorted by ascending -coordinate, then , then , and serialized into a 1D sequence as
where or .
Face Sequence Construction:
After vertices are sampled, faces are enumerated by increasing order of their smallest vertex index. Within each face, indices are cyclically rotated so that the smallest index appears first. These are flattened and distinguished by “n” markers and terminated by “s”:
where each is a vertex index from $1$ to .
Model inputs to the Transformer include learned value embeddings (mapping quantized values or special tokens to -dimensional vectors), coordinate-type embeddings (distinguishing , , ), and position embeddings (indexing vertex or face sequence position).
2. Autoregressive Probability Modeling
The underlying generative framework factorizes mesh probability as
with sequential expansion within each phase: where and . During sampling, first is generated until “s” is emitted and decoded as the vertex set . With fixed, the face sequence is then autoregressively sampled and decoded to generate polygon faces . Both steps are formally probabilistic, supporting multiple plausible outputs for a given condition.
3. Transformer Architecture and Conditioning
The decoder stack for vertices employs 18 layers of decoder-only Transformers (hidden dim , 8 heads), with masked self-attention restricting access to prior tokens . Face prediction uses an encoder-decoder configuration: an encoder over all vertex-coordinates yields contextual embeddings (covering actual vertex-coordinates and the two special tokens), and a 12-layer Transformer decoder operates over with cross-attention on .
Conditioning is accomplished by introducing additional signals into the generation stack:
- Class-conditioning: Each class is assigned a learned embedding, projected to dimensions and incorporated into each Transformer layer.
- Image-conditioning: A 2D ResNet-style encoder produces (16×16) spatial feature maps, reshaped to a 256-length sequence for cross-attention.
- Voxel-conditioning: Analogous to image, but with a 3D ResNet yielding (7×7×7)× embeddings.
These mechanisms allow the model to synthesize meshes from diverse upstream signals, enabling reconstruction from images, voxels, or class contexts.
4. Training, Optimization, and Metrics
Training maximizes the joint log-likelihood over a dataset of meshes , with the negative log-likelihood objective:
To enable comparison across variable-sized meshes, bits-per-vertex is reported:
Unconditional Baselines (ShapeNet):
- Uniform valid baseline: 21.41 bits/vertex (vertices), 25.79 bits/vertex (faces)
- Draco (8-bit quant.): ≈27.68 bits/vertex (total)
- PolyGen achieves: 2.46 bits/vertex (vertices, ≈85% accuracy), 1.82 bits/vertex (faces, ≈90% accuracy), total 4.28 bits/vertex
For mesh reconstruction from conditionals, symmetric Chamfer distance (CD) between two meshes’ point clouds is used:
where are point clouds sampled from meshes.
PolyGen’s conditional mesh generation from images or voxels achieves next-step accuracy of (vertices) and (faces) for image conditionals, with bits/vertex 4.11, and analogous performance for class- and voxel-conditioned setups.
5. Inference, Sampling, and Uncertainty Capture
At inference, vertices are autoregressively sampled token-by-token, with logits computed at each step by the Transformer stack and mapped via softmax to obtain sampling probabilities. A similar procedure is used for faces, using cross-attention on previously determined vertices. The sampling pseudocode is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
V_seq = [] for t in 1..max_len: logits = TransformerVertex(V_seq) probs = softmax(logits) v_t = sample_from(probs) # supports top-p nucleus sampling if v_t == "s": break append(V_seq, v_t) Decode V_seq → vertex set V F_seq = [] for t in 1..max_len: ptr_logits = TransformerFace(F_seq; encoder_outputs=V_emb) ptr_probs = softmax(ptr_logits) f_t = sample_from(ptr_probs) if f_t == "s": break append(F_seq, f_t) Decode F_seq → face list F return Mesh(V, F) |
Nucleus (top-) sampling, with , is employed to avoid low-probability tails, empirically yielding more realistic mesh diversity. The probabilistic formulation allows ensemble sampling from a single input to capture multi-modal ambiguities inherent in, for example, image-based mesh inference. This supports greater diversity and plausible structural variation in outputs.
6. Performance, Evaluation, and Use Cases
PolyGen’s autoregressive polygon decoder produces mesh samples that match the empirical distributions of real data across vertex count, face count, average node degree, average face area, and average edge length. For unconditional sample statistics, the distribution of PolyGen samples aligns with the ShapeNet test set.
Qualitatively, generated meshes are directly usable, typically requiring no post-processing or triangulation. Notably, PolyGen produces large flat polygons and exhibits consistent part symmetries where present in the data. In conditional reconstruction (e.g., generating a mesh from a single image), a single PolyGen sample lags AtlasNet (a competing patch-based reconstruction baseline) in Chamfer distance, but the best of ten samples matches or slightly outperforms AtlasNet. This suggests that diverse sampling enables competitive or superior mesh quality in ambiguous or multi-modal settings (Nash et al., 2020).
Summary evaluation metrics for PolyGen:
| Conditioning | Bits/vertex | Vertex Acc. | Face Acc. |
|---|---|---|---|
| Class-conditional | 4.24 | 85.3% | 89.9% |
| Image-conditional | 4.11 | 85.7% | 90.0% |
| Voxel-conditional | 4.01 | 85.9% | 90.0% |
A plausible implication is that autoregressive polygon decoders offer a unified framework for both unconditional mesh synthesis and conditional mesh reconstruction, delivering strong performance across different scenarios without training on task-specific metrics.
7. Related Research and Future Directions
Autoregressive polygon decoders illustrate a paradigm shift, modeling explicit mesh structures with deep autoregressive architectures rather than alternative volumetric or implicit approaches. This methodology may support integration with downstream applications needing high-quality, directly usable mesh outputs. Extensions could involve alternative quantization strategies, further architectural scaling, or incorporating more complex mesh attributes.
PolyGen defines the current benchmark for end-to-end probabilistic 3D mesh modeling via an autoregressive polygon decoder (Nash et al., 2020). Future research may explore enhancements including richer conditioning signals, tighter integration of reconstruction loss terms, or improved sampling strategies to further close the gap on task-directed reconstruction metrics.