MeshXL Autoregressive Transformers

Updated 7 December 2025

MeshXL Autoregressive Transformers are generative transformer models that synthesize 3D polygon meshes via autoregressive sequence modeling.
They leverage a fixed mesh-to-sequence conversion and a Neural Coordinate Field embedding to transform complex geometric data into a discrete token sequence.
Experimental evaluations show that MeshXL achieves high-fidelity, storage-efficient, and fast-rendering outputs, outperforming prior 3D generative methods.

MeshXL Autoregressive Transformers comprise a family of generative pre-trained transformer models for direct 3D polygon mesh synthesis via autoregressive sequence modeling. Capitalizing on a fixed mesh-to-sequence ordering and the Neural Coordinate Field (NeurCF) embedding, MeshXL applies large-scale language modeling methodologies to the domain of 3D mesh generation, yielding high-fidelity, storage-efficient, and fast-rendering outputs. The architecture, training pipeline, and applications establish a new foundation for 3D generative modeling while outperforming prior approaches in both quantitative and human-evaluated benchmarks (Chen et al., 31 May 2024).

1. Mesh-to-Sequence Conversion

MeshXL addresses the inherent irregularities of 3D polygonal meshes by imposing a pre-defined, globally consistent ordering, enabling sequence-based modeling. The process is as follows:

Normalization and Quantization: Each mesh is normalized such that its longest axis fits a unit cube, followed by discretization of all vertex coordinates into unsigned integers along $(x, y, z)$ .
Face and Vertex Ordering: For each $k$ -sided face, its $k$ vertices are cyclically permuted so that the $(z, y, x)$ coordinate triples are in ascending lexicographic order, thereby preserving normal-vector orientation.
Face Sorting: Faces are ordered globally by the lexicographically smallest permuted vertex.
Sequence Flattening: The mesh $M \in \mathbb{Z}^{n \times k \times 3}$ is flattened into a 1D sequence $s$ of length $T = n \cdot k \cdot 3$ , optionally augmented with special tokens.

Given this ordering, mesh generation is cast as modeling the distribution:

$p(M) = p(s) = \prod_{t=1}^T p(s_t \mid s_{<t}).$

This formalizes mesh generation as an autoregressive prediction over a discrete token sequence.

2. Neural Coordinate Field Representation

To mitigate the semantic gap between geometric data and sequence modeling, MeshXL introduces the Neural Coordinate Field:

Token Embeddings: Each quantized coordinate $p = (x, y, z)$ is represented by three independently learned embeddings:

$\mathcal{F}(p) = (\mathcal{E}(x), \mathcal{E}(y), \mathcal{E}(z)) \in \mathbb{R}^{3d},$

where $\mathcal{E}: \{0, 1, \ldots, N\} \to \mathbb{R}^d$ is a trainable embedding table shared across all axes.

Face Encodings: A $k$ -sided face $f^{(i)}$ with vertices $p_1^{(i)}, \ldots, p_k^{(i)}$ is encoded as

$\mathcal{E}_{\text{face}}(f^{(i)}) = [\mathcal{F}(p_1^{(i)}), \ldots, \mathcal{F}(p_k^{(i)})] \in \mathbb{R}^{3kd}.$

This hybridizes explicit 3D geometric structure with implicit, high-dimensional neural embeddings, providing a representation that is both expressive and amenable to sequence modeling.

3. Transformer Architecture and Input Encoding

MeshXL is instantiated as a decoder-only transformer, built on the OPT codebase, with no 3D-specific attention mechanisms. Its main architectural variants include:

Model Size	Layers	Attention Heads	$d_{\text{model}}$	FFN Dim
MeshXL-125M	12	12	768	3072
MeshXL-350M	24	16	1024	4096
MeshXL-1.3B	24	32	2048	8192

Input Vocabulary: Tokens consist of discrete coordinate integers (for each axis), specialized tags ( $<$ tri $>$ , $<$ quad $>$ ) for face types, and $<$ bos $>$ / $<$ eos $>$ delimiters.
Embedding Pipeline: Each token is mapped to a $d$ -dimensional embedding; standard learned positional embeddings of length $T$ are summed prior to transformer processing.
Attention: The model uses standard autoregressive causal attention without geometric augmentation. All mesh topology and geometry are captured via ordering and embedding.

4. Training Objectives and Conditional Variants

MeshXL is trained via maximum likelihood estimation:

Unconditional: The loss function for an input sequence $s = (s_1, \ldots, s_T)$ is

$\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log P_\theta(s_t \mid s_{<t}).$

Conditional Tasks: When generating meshes conditioned on auxiliary modalities $\mathcal{X}$ (e.g., image, text), the loss adopts the form

$\mathcal{L}_{\mathcal{X} \to \mathrm{mesh}}(\theta) = -\sum_{t=1}^{T} \log P_\theta(s_t \mid s_{<t};\,\mathcal{X}).$

Regularization: No auxiliary geometry-specific losses are used. Training relies on AdamW (weight decay 0.1) and gradient clipping.
Conditional Prefixes: For text-to-mesh, BERT embeddings pass through a Q-Former (32 learnable queries) to generate a 32-token prefix. Image-to-mesh uses a ViT (Vision Transformer) encoder with the same Q-Former architecture.

5. Training Regimen and Dataset Construction

MeshXL is pre-trained on a large-scale, multi-domain mesh corpus:

Data Sources: ShapeNet V2 ( $\sim$ 51k CAD models), 3D-FUTURE ( $\sim$ 10k), Objaverse ( $\sim$ 800k), and Objaverse-XL ( $\sim$ 10M).
Filtering and Augmentation: Meshes above 800 faces are planar-decimated (up to 20k faces) and further filtered, yielding $\sim$ 2.51M meshes for training and $\sim$ 18k for validation. Augmentations include random $90^\circ$ rotations and per-axis scaling in $[0.9, 1.1]$ .
Pre-training Tokens: The cumulative token count observed during pre-training is $\sim$ 150 billion.
Fine-tuning: For shape completion, 50% of face tokens are provided as a prefix; the remaining are predicted. Conditional tasks prepend 32-token prefixes to the mesh sequence, as described previously.

6. Experimental Evaluation and Model Benchmarking

Extensive experiments validate the model’s performance:

Unconditional Generation: On ShapeNet (1,000 samples/category), MeshXL-350M outperforms PolyGen, GET3D, and MeshGPT for chairs, with metrics:
- COV: $50.8\%$ vs. $42.0\%$ (MeshGPT)
- MMD: $3.17 \times 10^{-3}$ vs. $4.75 \times 10^{-3}$
- JSD: $9.66 \times 10^{-3}$ vs. $55.16 \times 10^{-3}$
- FID: $28.29$ vs. $39.52$
Model Scaling: On Objaverse, as model capacity increases (125M $\to$ 350M $\to$ 1.3B), COV improves from $39.8\%$ to $42.9\%$ ; MMD drops from $5.21$ to $4.16 \times 10^{-3}$ ; JSD from $26.0$ to $21.0 \times 10^{-3}$ . 1-NNA approaches the balanced regime of $50\%$ .
User Studies: For chairs, 434 annotators rated Quality, Artistic, and Triangulation on [0,5] scale:
- PolyGen: (2.53, 2.72, 3.15)
- GET3D: (3.15, 2.46, 3.15)
- MeshXL: (3.96, 3.45, 3.72)
- Ground-truth: (4.08, 3.33, 3.75)
Conditional Generation: For shape completion, MeshXL generates plausible completions from half-meshes. Text-to-mesh and image-to-mesh tasks yield high-fidelity, well-triangulated meshes aligned with prompt content.
Qualitative Results: Samples exhibit sharp edges, smooth surfaces, and plausible designs. Outputs are compatible with standard texturing workflows (e.g., Paint3D).

7. Significance and Core Innovations

MeshXL’s approach crystallizes around three central ideas:

The Neural Coordinate Field embedding, fusing explicit (integer) 3D geometry with implicit, high-dimensional spaces via axis-aligned embedding tables.
A globally consistent flattening strategy—surface orientation-preserving ordering at the vertex and face levels—enabling canonical sequence modeling.
Direct application of large-scale, decoder-only transformers (OPT), eschewing multi-stage post-processing or VQ-based quantization pipelines.

This design enables a fully end-to-end pipeline for 3D mesh synthesis, outperforming prior methods in diversity, fidelity, and user preference, and providing a foundation for mesh-based conditional generative tasks across multiple modalities (Chen et al., 31 May 2024).

PDF Markdown Chat (Pro)

References (1)

MeshXL: Neural Coordinate Field for Generative 3D Foundation Models (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MeshXL Autoregressive Transformers.