ProcGen3D: Procedural 3D Asset Generation

Updated 17 November 2025

ProcGen3D is a framework that unifies procedural graph grammars with neural transformer models to achieve scalable and interpretable 3D asset generation.
It employs edge-based tokenization and autoregressive modeling to convert image cues into compact, editable 3D representations with high fidelity.
The integration of MCTS-guided sampling ensures semantic and geometric consistency, maintaining both global structure and fine details in 3D reconstructions.

ProcGen3D refers to a significant line of research and practical systems for procedural generation and neural representation in 3D content creation, image-based reconstruction, and interactive editing. In contemporary literature, "ProcGen3D" encompasses both model- and data-driven methods for compact, interpretable, and highly controllable 3D asset generation—from invertible procedural grammars to transformer-based neural autoregressive models, with applications ranging from assets for visual effects and games to novel urban and natural scene synthesis. The central theme is the unification of parametric, grammar-based, or procedural representations with machine learning techniques to enable scalable, diverse, and editable 3D content.

1. Procedural Graph Representations of 3D Objects

Core to ProcGen3D (Zhang et al., 10 Nov 2025) is the abstraction of 3D assets as procedural graphs $\mathcal{G} = (V, E, \mathcal{A}_V, \mathcal{A}_E)$ :

$V$ is a set of nodes corresponding to discrete semantic components, each carrying attribute vectors $\mathcal{A}_V$ (e.g., 3D coordinates, radii, or tags).
$E$ is a set of edges encoding relations such as connectivity or attachment, with attributes $\mathcal{A}_E$ (e.g., limb type, edge length).
The graph is generated via a context-free graph grammar $G_{\text{proc}} = (N, \Sigma, P, S)$ , where production rules $p_j$ sequentially expand nonterminals (uninstantiated nodes) into structural and attribute-labeled subgraphs.

This representation enables a decoupling of the procedural space (underlying parametric, rule-based model) from explicit geometric details, allowing image-to-3D systems to output generator-friendly, compact, and interpretable descriptions that are directly usable by procedural mesh generators (e.g., Infinigen, CEM).

2. Edge-Based Tokenization for Neural Sequentialization

To bridge procedural abstraction with neural sequence modeling, ProcGen3D utilizes edge-based tokenization. Each edge $e_i$ is mapped to a token tuple $\tau(e_i) = (v_a, \mathcal{A}_V(v_a), v_b, \mathcal{A}_V(v_b), \mathcal{A}_E(e_i))$ . Continuous attributes are discretized (e.g., into $K=128$ bins), and categorical class labels are enumerated, yielding a total vocabulary $|\mathcal{V}|$ of hundreds to thousands of tokens.

A traversal order (DFS for plants, BFS for bridges) produces a linear edge sequence, separated by a special token $\langle SEP\rangle$ . The sequence $S(\mathcal{G})$ provides an autoregressive context for machine learning models, as required by transformer architectures.

3. Autoregressive Transformer Priors for Procedural Graph Generation

ProcGen3D employs a GPT-style transformer $\mathcal{T}$ to model $P_\mathcal{T}(t_i \mid t_{<i}, z)$ , where input tokens $t_{<i}$ represent the procedural graph prefix and $z$ is an image embedding $\phi(I)$ (e.g., from a ResNet encoder). The transformer architecture typically mirrors configurations such as OPT-350M: 24 layers, $d=1024$ embedding/hid. size, 16 attention heads, and 4096-dimensional feed-forward blocks.

Each token receives a learned embedding and positional encoding; image context is fused via concatenation or cross-attention. Training minimizes standard autoregressive cross-entropy loss:

$\mathcal{L} = -\sum_{i=1}^L \log P_\mathcal{T}(t_i \mid t_{<i}, z)$

This formulation allows learned procedural priors to be conditioned on diverse, real-world imagery, while maintaining procedural structure and editability in the output.

4. MCTS-Guided Sampling for Image-Consistent 3D Reconstruction

To enforce semantic and geometric faithfulness between the generated procedural graph and the input image, ProcGen3D introduces an MCTS-guided decoding procedure. At test time:

The search state $s_i$ is a partial token sequence.
Successor states are expansions with candidate edges proposed by the transformer's logits.
The selection strategy employs an Upper Confidence Bound:

$s^* = \arg\max_{s'} [ Q(s, s') + c\sqrt{\frac{\log N(s)}{1+N(s,s')}} ]$

Simulations rollout the transformer predictor for $T$ more steps, assemble the procedural asset, and compute a silhouette-based reward $r$ quantifying agreement between the rendered prediction and the input mask $M_I$ :

$r = \lambda \frac{|M_I \cap M_{\hat{H}}|}{|M_{\hat{H}}|} + (1-\lambda)\frac{|M_I \cap M_{\hat{H}}|}{|M_I|}$

Traversing the search budget, the most promising next edge is selected at each iteration, yielding discrete, interpretable, and image-matched procedural graphs.

5. Comparative Evaluation and Ablations

ProcGen3D is empirically validated on synthetic datasets of cacti, trees, and bridges (each with 10,000 instances; node counts from 20 to 600), employing metrics such as chamfer distance (CD), LPIPS, and CLIP similarity.

Category	CD ↓	LPIPS ↓	CLIP-Sim ↑
Cactus	0.0297	0.097	0.9268
Tree	0.0265	0.081	0.9769
Leafy Tree	0.0648	0.168	0.9493
Pine Tree	0.0302	0.079	0.9680
Bridge	0.0141	0.052	0.9820

Ablation studies demonstrate:

Superior performance from the incorporation of RGB cues and DFS tokenization (e.g., Tree: mask input gives CD=0.0485, rgb input gives CD=0.0265).
Consistent improvement from MCTS guidance versus naive autoregressive decoding.
The ability to preserve both global topological and fine structural fidelity, outstripping alternatives such as Marching-Cubes post-processed neural fields.

6. Limitations, Applicability, and Extensions

Limitations:

Applicability is restricted by the availability of procedural generators for the target asset category.
MCTS-based search introduces significant test-time compute overhead, especially for highly complex graphs (e.g., tens of minutes for large trees).

ProcGen3D directly enables:

Generation of parametric 3D asset libraries amenable to high-level editing and style transfer.
Automated stylized asset creation from photographic inputs, supporting rapid iteration.
Generalization to domains with rule-governed structure (e.g., urban architecture, plant modeling).

Planned or suggested future work includes development of differentiable procedural surrogates for end-to-end learning and expanding grammar expressiveness to broader asset classes.

7. Context within the Procedural Generation Ecosystem

ProcGen3D advances the paradigm of procedural modeling by bridging explicit, interpretable graph grammars with neural autoregressive modeling and search. This complements prior work in inverse procedural modeling (e.g., genetic/memetic optimization of generator parameters (Garifullin et al., 2023)), grammar-based L-systems, and remains distinct from mesh-centric generative methods (e.g., voxel fields, mesh autoencoders).

The methodology is robust to domain shifts, as model training is performed on synthetic data yet generalizes to real imagery. The sequenced, interpretable procedural graphs are immediately compatible with downstream rule-based generators, enabling both high-level controllability and production-ready mesh output.

PDF Markdown Chat (Pro)

References (2)

ProcGen3D: Learning Neural Procedural Graph Representations for Image-to-3D Reconstruction (2025)

Single-view 3D reconstruction via inverse procedural modeling (2023)

Follow Topic

Get notified by email when new papers are published related to ProcGen3D.