Papers
Topics
Authors
Recent
Search
2000 character limit reached

SceneNAT: Masked Non-Autoregressive 3D Scene Synthesis

Updated 19 January 2026
  • SceneNAT is a masked non-autoregressive Transformer for 3D indoor scene synthesis that reconstructs discretized object and scene representations from language directives.
  • It employs a two-level masking scheme and iterative parallel decoding to efficiently recover object attributes and spatial layouts.
  • Explicit relational reasoning via a triplet prediction module enforces semantic and spatial constraints, outperforming autoregressive and diffusion-based models.

SceneNAT is a masked non-autoregressive Transformer model for instruction-conditioned 3D indoor scene synthesis. The system generates complete 3D layouts from natural language directives by reconstructing fully discretized representations of both semantic and spatial object attributes, leveraging a two-level masking scheme and explicit relational reasoning. SceneNAT achieves state-of-the-art results in semantic compliance, physical plausibility, and computational efficiency compared to autoregressive and diffusion-based baselines on standard 3D indoor datasets (Choi et al., 12 Jan 2026).

1. Discretized Scene and Object Representation

SceneNAT encodes a 3D indoor scene as a set of up to NN objects with each object rjr_j described by five types of discretized tokens: semantic category, appearance, position, scale, and rotation. The formal scene matrix:

R={(xj,vj,tj,lj,ϕj)}j=1NR = \{ (x_j, v_j, t_j, l_j, \phi_j) \}_{j=1}^N

  • xjCx_j \in C: object class (from a vocabulary C|C|)
  • vj{1,,K}4v_j \in \{1,\dots,K\}^4: four appearance tokens from a VQ-VAE (codebook K=64K=64, m=512m=512)
  • tj{1,,B}3t_j \in \{1,\dots,B\}^3: 3D translation (with B=64B=64 bins)
  • lj{1,,B}3l_j \in \{1,\dots,B\}^3: 3D scale
  • ϕj{1,,Θ}\phi_j \in \{1,\dots,\Theta\}: yaw rotation (Θ=36\Theta=36)

All tokens are embedded through learnable lookup tables with input/output weight tying for consistency. Appearance tokens are generated using features from a frozen OpenShape encoder passed into a vector quantized bottleneck. Translation, scale, and rotation are uniformly quantized. This fully discrete, matrix-based scene representation enables efficient parallel token prediction and supports rigorous handling of instance, attribute, and relational information.

2. Masked Generation Objective and Loss Functions

SceneNAT formulates synthesis as a masked-token reconstruction problem, learning to recover masked object attributes from both unmasked tokens and language instructions. The masking operates at two levels:

  • Object-level masking: entire rows (objects) are masked
  • Attribute-level masking: individual tokens within objects are masked

The total masking ratio mtotalm_{total} is dynamically scheduled using mtotal=cos(πτ2)m_{total} = \cos(\frac{\pi \tau}{2}) with τUniform(0,1)\tau \sim \text{Uniform}(0,1). The split between object and token masking is also stochastic, promoting both local and holistic structural learning.

During training:

  • 10% of masked tokens are replaced by random alternatives
  • 88% replaced by the [MASK] token
  • 2% remain unchanged (BERT-style)

The reconstruction loss aggregates cross-entropy over all attribute types and masked positions:

Lrecon=iRM[λxLx+λvLv+λtLt+λlLl+λϕLϕ]L_{recon} = \sum_{i \in R_M} [ \lambda_x L_x + \lambda_v L_v + \lambda_t L_t + \lambda_l L_l + \lambda_{\phi} L_{\phi} ]

where LxL_x, LvL_v, LtL_t, LlL_l, LϕL_{\phi} are per-attribute cross-entropy losses, RMR_M indexes masked positions, and λ\lambda_\star are weights (all set to 1.0).

An auxiliary triplet loss (see Section 4) is incorporated for relational reasoning:

Ltotal=Lrecon+λtripletLtripletL_{total} = L_{recon} + \lambda_{triplet} L_{triplet}

3. Model Architecture and Parallel Decoding

SceneNAT adopts a non-autoregressive Transformer architecture for efficient and parallel synthesis.

  • Scene Decoder: An LsceneL_{scene}-layer Transformer (8 layers in SceneNAT-B, 4 in SceneNAT-S), operating on NN object slots, each slot containing embedded/masked attributes and the CLIP-encoded instruction (dim 512). Each layer comprises self-attention, cross-attention to the text, feed-forward networks, pre-layernorm, and dropout (p=0.1p=0.1).
  • Attribute Heads: Three output heads predict:
    • Object class
    • Four appearance tokens
    • Layout tokens (translation, scale, rotation)
    • All heads reuse the embedding tables from the input stage.
  • Iterative Parallel Decoding: The model synthesizes scenes in TT parallel, non-autoregressive passes (typically T=30T=30–$50$). At each pass:

    1. Tokens are scored by confidence.
    2. The bottom mtm_t fraction is remasked using the training masking scheme.
    3. Decoding proceeds from most to least ambiguous attributes, similar to coarse-to-fine denoising.
    4. Annealing follows mtm_t from fully masked to zero, matching training.

This approach attains competitive synthesis quality with orders-of-magnitude fewer decoding steps than diffusion models and no sequential token dependencies.

4. Explicit Relational Reasoning via Triplet Predictor

SceneNAT models object relations specified in the instruction using a dedicated triplet prediction module:

  • Triplet Queries: Nq=4N_q = 4 learnable queries are used in a two-layer Transformer decoder, cross-attending to both text and per-object embeddings, producing NqN_q relation-aware features.

  • Triplet Classification: Each feature predicts subject, predicate (spatial relation), and object using independent MLP heads.
  • Hungarian Matching and Loss: Ground-truth relation triplets are matched to outputs with minimal cross-entropy loss using the Hungarian algorithm. The triplet loss:

Ltriplet=k=1Nq[HCE(S^k,Sk)+HCE(P^k,Pk)+HCE(O^k,Ok)]L_{triplet} = \sum_{k=1}^{N_q} [ HCE(\hat{S}_k, S_k^*) + HCE(\hat{P}_k, P_k^*) + HCE(\hat{O}_k, O_k^*) ]

A reduced weight (0.1) is applied to slots assigned the null triplet. By decoupling this task from attribute prediction, SceneNAT enforces explicit spatial and semantic constraints as dictated by language directives.

5. Language Instruction Conditioning

SceneNAT conditions its generative process on natural language via cross-attention to CLIP text features:

  • Text is encoded by a frozen CLIP model to LT×512L_T \times 512 features.
  • In each scene and triplet decoder layer, cross-attention maps object and triplet queries to the text sequence:

CrossAttn(Q,K,V)=softmax(QWq(KWk)T/d)VWv\operatorname{CrossAttn}(Q, K, V) = \operatorname{softmax}\left(QW_q (KW_k)^T / \sqrt{d}\right) VW_v

This tight fusion enables the model to integrate both fine-grained and global semantic directives throughout the synthesis process.

Classifier-Free Guidance is employed: during training, conditioning is randomly dropped (p=0.1p=0.1); at inference, logits are mixed via w=(1+γ)wcγwuw = (1+\gamma)w_c-\gamma w_u (γ=1.0\gamma=1.0), enhancing instruction fidelity.

6. Empirical Results and Computational Analysis

Extensive evaluation on the 3D-FRONT dataset demonstrates that SceneNAT-B outperforms ATISS, DiffuScene, and InstructScene in both semantic and physical fidelity (see summary table):

Method iRecall↑ FID Speed (s/batch) Params (M) TFLOPs
SceneNAT-B 70.45% 109.55 1.02 53 7.9
InstructScene 66.72% 118.05 6.73 87.7 44.3
DiffuScene 45.98% 191.14 33.28 63.4 63.5

SceneNAT achieves higher recall of instructed relations (iRecall), lower FID, and faster inference. FID versus inference step curves show SceneNAT quality matches diffusion models with two orders of magnitude fewer decoding steps; for instance, ~30 steps in SceneNAT correspond to >1000 in diffusion.

Physical plausibility, as measured by intersection volume, is also improved, despite the absence of explicit collision loss. SceneNAT attains the lowest VsumV_{sum} across all room types, indicating successful global modeling of layout constraints.

7. Ablation, Limitations, and Future Directions

Ablation studies reveal key dependencies:

  • Removing object masking causes a 4.34% iRecall drop; removing token masking, 2.19%; removing the replace/remask scheme severely degrades generation (FID rises by 80%).
  • Eliminating the triplet predictor reduces iRecall by 7.68%.
  • The token discretization granularity (bins 32–256) yields optimal tradeoffs near B=64B=64.

Limitations:

  • The fixed codebook restricts appearance and geometric fidelity.
  • Maximum relation prediction is Nq=4N_q=4; more complex instructions would require more queries.
  • The approach is dependent on retrieval from a closed object library; generative mesh decoding is unaddressed.
  • Masking schedules and classifier-free guidance parameters are sensitive hyperparameters.

Potential extensions include adaptive quantization, hyperparameter search (AutoNAT), conditioning on structural priors (e.g., floorplans), and integration with end-to-end mesh synthesis workflows (Choi et al., 12 Jan 2026).


SceneNAT’s two-level masked modeling, non-autoregressive parallel decoding, and explicit relational supervision position it as a computationally efficient and instruction-compliant solution for 3D indoor scene synthesis from language. Its discrete scene matrix formulation supports scalable, interpretable synthesis, while outperforming current autoregressive and diffusion-based paradigms in both semantic and physical scene correctness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SceneNAT.