SceneNAT: Masked Non-Autoregressive 3D Scene Synthesis
- SceneNAT is a masked non-autoregressive Transformer for 3D indoor scene synthesis that reconstructs discretized object and scene representations from language directives.
- It employs a two-level masking scheme and iterative parallel decoding to efficiently recover object attributes and spatial layouts.
- Explicit relational reasoning via a triplet prediction module enforces semantic and spatial constraints, outperforming autoregressive and diffusion-based models.
SceneNAT is a masked non-autoregressive Transformer model for instruction-conditioned 3D indoor scene synthesis. The system generates complete 3D layouts from natural language directives by reconstructing fully discretized representations of both semantic and spatial object attributes, leveraging a two-level masking scheme and explicit relational reasoning. SceneNAT achieves state-of-the-art results in semantic compliance, physical plausibility, and computational efficiency compared to autoregressive and diffusion-based baselines on standard 3D indoor datasets (Choi et al., 12 Jan 2026).
1. Discretized Scene and Object Representation
SceneNAT encodes a 3D indoor scene as a set of up to objects with each object described by five types of discretized tokens: semantic category, appearance, position, scale, and rotation. The formal scene matrix:
- : object class (from a vocabulary )
- : four appearance tokens from a VQ-VAE (codebook , )
- : 3D translation (with bins)
- : 3D scale
- : yaw rotation ()
All tokens are embedded through learnable lookup tables with input/output weight tying for consistency. Appearance tokens are generated using features from a frozen OpenShape encoder passed into a vector quantized bottleneck. Translation, scale, and rotation are uniformly quantized. This fully discrete, matrix-based scene representation enables efficient parallel token prediction and supports rigorous handling of instance, attribute, and relational information.
2. Masked Generation Objective and Loss Functions
SceneNAT formulates synthesis as a masked-token reconstruction problem, learning to recover masked object attributes from both unmasked tokens and language instructions. The masking operates at two levels:
- Object-level masking: entire rows (objects) are masked
- Attribute-level masking: individual tokens within objects are masked
The total masking ratio is dynamically scheduled using with . The split between object and token masking is also stochastic, promoting both local and holistic structural learning.
During training:
- 10% of masked tokens are replaced by random alternatives
- 88% replaced by the [MASK] token
- 2% remain unchanged (BERT-style)
The reconstruction loss aggregates cross-entropy over all attribute types and masked positions:
where , , , , are per-attribute cross-entropy losses, indexes masked positions, and are weights (all set to 1.0).
An auxiliary triplet loss (see Section 4) is incorporated for relational reasoning:
3. Model Architecture and Parallel Decoding
SceneNAT adopts a non-autoregressive Transformer architecture for efficient and parallel synthesis.
- Scene Decoder: An -layer Transformer (8 layers in SceneNAT-B, 4 in SceneNAT-S), operating on object slots, each slot containing embedded/masked attributes and the CLIP-encoded instruction (dim 512). Each layer comprises self-attention, cross-attention to the text, feed-forward networks, pre-layernorm, and dropout ().
- Attribute Heads: Three output heads predict:
- Object class
- Four appearance tokens
- Layout tokens (translation, scale, rotation)
- All heads reuse the embedding tables from the input stage.
- Iterative Parallel Decoding: The model synthesizes scenes in parallel, non-autoregressive passes (typically –$50$). At each pass:
- Tokens are scored by confidence.
- The bottom fraction is remasked using the training masking scheme.
- Decoding proceeds from most to least ambiguous attributes, similar to coarse-to-fine denoising.
- Annealing follows from fully masked to zero, matching training.
This approach attains competitive synthesis quality with orders-of-magnitude fewer decoding steps than diffusion models and no sequential token dependencies.
4. Explicit Relational Reasoning via Triplet Predictor
SceneNAT models object relations specified in the instruction using a dedicated triplet prediction module:
Triplet Queries: learnable queries are used in a two-layer Transformer decoder, cross-attending to both text and per-object embeddings, producing relation-aware features.
- Triplet Classification: Each feature predicts subject, predicate (spatial relation), and object using independent MLP heads.
- Hungarian Matching and Loss: Ground-truth relation triplets are matched to outputs with minimal cross-entropy loss using the Hungarian algorithm. The triplet loss:
A reduced weight (0.1) is applied to slots assigned the null triplet. By decoupling this task from attribute prediction, SceneNAT enforces explicit spatial and semantic constraints as dictated by language directives.
5. Language Instruction Conditioning
SceneNAT conditions its generative process on natural language via cross-attention to CLIP text features:
- Text is encoded by a frozen CLIP model to features.
- In each scene and triplet decoder layer, cross-attention maps object and triplet queries to the text sequence:
This tight fusion enables the model to integrate both fine-grained and global semantic directives throughout the synthesis process.
Classifier-Free Guidance is employed: during training, conditioning is randomly dropped (); at inference, logits are mixed via (), enhancing instruction fidelity.
6. Empirical Results and Computational Analysis
Extensive evaluation on the 3D-FRONT dataset demonstrates that SceneNAT-B outperforms ATISS, DiffuScene, and InstructScene in both semantic and physical fidelity (see summary table):
| Method | iRecall↑ | FID↓ | Speed (s/batch) | Params (M) | TFLOPs |
|---|---|---|---|---|---|
| SceneNAT-B | 70.45% | 109.55 | 1.02 | 53 | 7.9 |
| InstructScene | 66.72% | 118.05 | 6.73 | 87.7 | 44.3 |
| DiffuScene | 45.98% | 191.14 | 33.28 | 63.4 | 63.5 |
SceneNAT achieves higher recall of instructed relations (iRecall), lower FID, and faster inference. FID versus inference step curves show SceneNAT quality matches diffusion models with two orders of magnitude fewer decoding steps; for instance, ~30 steps in SceneNAT correspond to >1000 in diffusion.
Physical plausibility, as measured by intersection volume, is also improved, despite the absence of explicit collision loss. SceneNAT attains the lowest across all room types, indicating successful global modeling of layout constraints.
7. Ablation, Limitations, and Future Directions
Ablation studies reveal key dependencies:
- Removing object masking causes a 4.34% iRecall drop; removing token masking, 2.19%; removing the replace/remask scheme severely degrades generation (FID rises by 80%).
- Eliminating the triplet predictor reduces iRecall by 7.68%.
- The token discretization granularity (bins 32–256) yields optimal tradeoffs near .
Limitations:
- The fixed codebook restricts appearance and geometric fidelity.
- Maximum relation prediction is ; more complex instructions would require more queries.
- The approach is dependent on retrieval from a closed object library; generative mesh decoding is unaddressed.
- Masking schedules and classifier-free guidance parameters are sensitive hyperparameters.
Potential extensions include adaptive quantization, hyperparameter search (AutoNAT), conditioning on structural priors (e.g., floorplans), and integration with end-to-end mesh synthesis workflows (Choi et al., 12 Jan 2026).
SceneNAT’s two-level masked modeling, non-autoregressive parallel decoding, and explicit relational supervision position it as a computationally efficient and instruction-compliant solution for 3D indoor scene synthesis from language. Its discrete scene matrix formulation supports scalable, interpretable synthesis, while outperforming current autoregressive and diffusion-based paradigms in both semantic and physical scene correctness.