MeshXL Autoregressive Transformers
- MeshXL Autoregressive Transformers are generative transformer models that synthesize 3D polygon meshes via autoregressive sequence modeling.
- They leverage a fixed mesh-to-sequence conversion and a Neural Coordinate Field embedding to transform complex geometric data into a discrete token sequence.
- Experimental evaluations show that MeshXL achieves high-fidelity, storage-efficient, and fast-rendering outputs, outperforming prior 3D generative methods.
MeshXL Autoregressive Transformers comprise a family of generative pre-trained transformer models for direct 3D polygon mesh synthesis via autoregressive sequence modeling. Capitalizing on a fixed mesh-to-sequence ordering and the Neural Coordinate Field (NeurCF) embedding, MeshXL applies large-scale language modeling methodologies to the domain of 3D mesh generation, yielding high-fidelity, storage-efficient, and fast-rendering outputs. The architecture, training pipeline, and applications establish a new foundation for 3D generative modeling while outperforming prior approaches in both quantitative and human-evaluated benchmarks (Chen et al., 31 May 2024).
1. Mesh-to-Sequence Conversion
MeshXL addresses the inherent irregularities of 3D polygonal meshes by imposing a pre-defined, globally consistent ordering, enabling sequence-based modeling. The process is as follows:
- Normalization and Quantization: Each mesh is normalized such that its longest axis fits a unit cube, followed by discretization of all vertex coordinates into unsigned integers along .
- Face and Vertex Ordering: For each -sided face, its vertices are cyclically permuted so that the coordinate triples are in ascending lexicographic order, thereby preserving normal-vector orientation.
- Face Sorting: Faces are ordered globally by the lexicographically smallest permuted vertex.
- Sequence Flattening: The mesh is flattened into a 1D sequence of length , optionally augmented with special tokens.
Given this ordering, mesh generation is cast as modeling the distribution:
This formalizes mesh generation as an autoregressive prediction over a discrete token sequence.
2. Neural Coordinate Field Representation
To mitigate the semantic gap between geometric data and sequence modeling, MeshXL introduces the Neural Coordinate Field:
- Token Embeddings: Each quantized coordinate is represented by three independently learned embeddings:
where is a trainable embedding table shared across all axes.
- Face Encodings: A -sided face with vertices is encoded as
This hybridizes explicit 3D geometric structure with implicit, high-dimensional neural embeddings, providing a representation that is both expressive and amenable to sequence modeling.
3. Transformer Architecture and Input Encoding
MeshXL is instantiated as a decoder-only transformer, built on the OPT codebase, with no 3D-specific attention mechanisms. Its main architectural variants include:
| Model Size | Layers | Attention Heads | FFN Dim | |
|---|---|---|---|---|
| MeshXL-125M | 12 | 12 | 768 | 3072 |
| MeshXL-350M | 24 | 16 | 1024 | 4096 |
| MeshXL-1.3B | 24 | 32 | 2048 | 8192 |
- Input Vocabulary: Tokens consist of discrete coordinate integers (for each axis), specialized tags (tri, quad) for face types, and bos/eos delimiters.
- Embedding Pipeline: Each token is mapped to a -dimensional embedding; standard learned positional embeddings of length are summed prior to transformer processing.
- Attention: The model uses standard autoregressive causal attention without geometric augmentation. All mesh topology and geometry are captured via ordering and embedding.
4. Training Objectives and Conditional Variants
MeshXL is trained via maximum likelihood estimation:
- Unconditional: The loss function for an input sequence is
- Conditional Tasks: When generating meshes conditioned on auxiliary modalities (e.g., image, text), the loss adopts the form
- Regularization: No auxiliary geometry-specific losses are used. Training relies on AdamW (weight decay 0.1) and gradient clipping.
- Conditional Prefixes: For text-to-mesh, BERT embeddings pass through a Q-Former (32 learnable queries) to generate a 32-token prefix. Image-to-mesh uses a ViT (Vision Transformer) encoder with the same Q-Former architecture.
5. Training Regimen and Dataset Construction
MeshXL is pre-trained on a large-scale, multi-domain mesh corpus:
- Data Sources: ShapeNet V2 (51k CAD models), 3D-FUTURE (10k), Objaverse (800k), and Objaverse-XL (10M).
- Filtering and Augmentation: Meshes above 800 faces are planar-decimated (up to 20k faces) and further filtered, yielding 2.51M meshes for training and 18k for validation. Augmentations include random rotations and per-axis scaling in .
- Pre-training Tokens: The cumulative token count observed during pre-training is 150 billion.
- Fine-tuning: For shape completion, 50% of face tokens are provided as a prefix; the remaining are predicted. Conditional tasks prepend 32-token prefixes to the mesh sequence, as described previously.
6. Experimental Evaluation and Model Benchmarking
Extensive experiments validate the model’s performance:
- Unconditional Generation: On ShapeNet (1,000 samples/category), MeshXL-350M outperforms PolyGen, GET3D, and MeshGPT for chairs, with metrics:
- Model Scaling: On Objaverse, as model capacity increases (125M 350M 1.3B), COV improves from to ; MMD drops from $5.21$ to ; JSD from $26.0$ to . 1-NNA approaches the balanced regime of .
- User Studies: For chairs, 434 annotators rated Quality, Artistic, and Triangulation on [0,5] scale:
- PolyGen: (2.53, 2.72, 3.15)
- GET3D: (3.15, 2.46, 3.15)
- MeshXL: (3.96, 3.45, 3.72)
- Ground-truth: (4.08, 3.33, 3.75)
- Conditional Generation: For shape completion, MeshXL generates plausible completions from half-meshes. Text-to-mesh and image-to-mesh tasks yield high-fidelity, well-triangulated meshes aligned with prompt content.
- Qualitative Results: Samples exhibit sharp edges, smooth surfaces, and plausible designs. Outputs are compatible with standard texturing workflows (e.g., Paint3D).
7. Significance and Core Innovations
MeshXL’s approach crystallizes around three central ideas:
- The Neural Coordinate Field embedding, fusing explicit (integer) 3D geometry with implicit, high-dimensional spaces via axis-aligned embedding tables.
- A globally consistent flattening strategy—surface orientation-preserving ordering at the vertex and face levels—enabling canonical sequence modeling.
- Direct application of large-scale, decoder-only transformers (OPT), eschewing multi-stage post-processing or VQ-based quantization pipelines.
This design enables a fully end-to-end pipeline for 3D mesh synthesis, outperforming prior methods in diversity, fidelity, and user preference, and providing a foundation for mesh-based conditional generative tasks across multiple modalities (Chen et al., 31 May 2024).