ScenethesisLang: 3D Scene DSL

Updated 3 July 2026

ScenethesisLang is a domain-specific IR designed for composable, fine-grained, and verifiable 3D scene synthesis.
It employs forward and reverse pipelines to translate natural language into executable 3D scenes and back.
The system supports targeted modifications and high constraint satisfaction, ensuring accurate and efficient scene generation.

ScenethesisLang is a domain-specific, constraint-expressive intermediate representation (IR) developed for the modular synthesis, editing, and understanding of 3D scenes and interactive 3D software. It is designed to bridge the gap between natural language descriptions and executable 3D environments, emphasizing composability, fine-grained control, traceability, and systematic specification and satisfaction of spatial constraints. ScenethesisLang supports both forward (language-to-scene) and reverse (scene-to-language) synthesis pipelines, providing a formal language that encodes objects, regions, attributes, and a rich algebra of spatial and physical constraints while maintaining a direct mapping to 3D assets and runtime environments such as Unity (Li et al., 24 Jul 2025, Ling et al., 5 May 2025, Li et al., 20 Sep 2025).

1. Motivations and Core Principles

ScenethesisLang is motivated by key limitations in prior approaches to text-to-3D and programmatic scene generation. Existing methods—such as scene graphs, attribute lists, or end-to-end learned models—typically lack support for:

Composability and Fine-Grained Editability: Prior systems often require scene-wide regeneration for any edit, lack localized modifiability, or expose only high-level semantics.
Expressive, Verifiable Constraints: Conventional IRs support only discrete, binary spatial relations and do not encode continuous, real-world constraints, physical laws, or logical composition.
Traceability: Verifying, debugging, or mapping generated content back to user intent or requirements is difficult without a formal, inspectable representation.

ScenethesisLang addresses these with a modular IR that is both human- and machine-interpretable, exposing all objects, regions, constraints, and relationships for stepwise synthesis and verification. It enables independent pipeline stages for formalization, verification, asset grounding, spatial constraint solving, and code generation, allowing updates to one aspect of the scene without affecting unrelated components (Li et al., 24 Jul 2025).

2. Language Syntax, Grammar, and Semantics

ScenethesisLang is defined via an EBNF (Extended Backus-Naur Form) grammar with explicit support for object/region declaration, attribute and pose assignment, logical spatial assertions, and physical constraints. The top-level program is a sequence of statements:

$\begin{array}{l} \langle \mathit{Program} \rangle ::= \langle \mathit{Stmt} \rangle\;{\tt ;}\;\bigl[\langle\mathit{Program}\rangle\bigr] \ \langle \mathit{Stmt} \rangle ::= \langle \mathit{Decl} \rangle \mid \langle \mathit{Const} \rangle \mid \langle \mathit{Assign} \rangle \ \langle \mathit{Decl} \rangle ::= {\tt object}\;id \mid {\tt region}\;id \mid \tau\;id \ \langle \mathit{Const} \rangle ::= {\tt assert}\;\phi \;\mid\; {\tt allowCollide}(id,id)\;\mid\; {\tt allowOutside}(id) \ \langle \mathit{Assign} \rangle ::= id.\alpha \leftarrow e\;\mid\; id.\beta \leftarrow e\;\mid\; id \leftarrow e \ \alpha ::= {\tt color}\mid{\tt material}\mid{\tt features},\quad \beta ::= {\tt pos}\mid{\tt rot}\mid{\tt scale} \ \phi ::= e\;\bowtie\;e \;\mid\; {\tt inside}(id,id)\;\mid\;\phi\wedge\phi\;\mid\;\phi\vee\phi\;\mid\;\neg\phi \ \tau ::= {\tt Number}\mid{\tt Degree}\mid{\tt Bool}\mid{\tt Vector3}\mid{\tt Rotation}\mid{\tt Color}\mid{\tt Material} \ e ::= n\mid id\mid s\;\mid\;e\odot e\;\mid\;{\tt rand}(e,e)\;\mid\;{\tt vec3}(e,e,e)\;\mid\; {\tt rot}(e,e,e)\;\mid\;{\tt dot}(e,e)\;\mid\;id.p \ \bowtie ::= =\mid\neq\mid<\mid\le\mid>\mid\ge,\quad \odot ::= +\mid-\mid*\mid/ \end{array}$

Objects and regions form the primary semantic elements. Each object is associated with explicit pose (\texttt{pos}, \texttt{rot}, \texttt{scale}) and appearance attributes (\texttt{color}, \texttt{material}, \texttt{features}), and all geometric and spatial constraints are encoded with logical formulae. Hard constraints (via \texttt{assert}) mandate satisfaction in the final synthesized layout; soft constraints can be integrated into the solver objective. Physical constraints such as gravity, boundary inclusion, and support relationships are expressed with formal assertions (Li et al., 24 Jul 2025).

3. Pipeline Architecture and Synthesis Stages

Scenethesis-based 3D software synthesis is segmented into modular stages, each operating on or emitting a ScenethesisLang IR:

Stage I: Requirement Formalization User requirements in natural language (NL) are mapped to formal DSL via LLM-based scene-type classification, prompt expansion (with inferred constraints), region partitioning, and entity extraction. Spatial and hidden physical constraints are generated and embedded as formal assertions (e.g. collision avoidance, gravity, boundaries) in the DSL (Li et al., 24 Jul 2025).
Stage II: Asset Retrieval and Feature Assignment Each object is matched to a database asset using a bi-modal similarity metric combining visual and semantic retrieval (Li et al., 24 Jul 2025).
Stage III: Spatial Constraint Solving A dedicated constraint solver iteratively refines the scene layout to maximize satisfaction of all encoded constraints. Hard constraints are strictly enforced, and the system is capable of handling >100 constraints per scene.
Stage IV: Code Generation and Scene Export The fully realized scene, along with all DSL metadata, is exported (e.g., as a Unity scene with embedded IR for round-trip traceability) (Li et al., 24 Jul 2025).

This staged approach enables targeted modification (edit only specific DSL statements), systematic verification (static and dynamic analysis on the IR), and efficient scene regeneration.

4. Expressivity, Constraint Algebra, and Verification

ScenethesisLang supports a rich algebra over constraints:

Continuous Predicates:

Assertions such as $\|{\tt pos}(o_i) - {\tt pos}(o_j)\|_2 \le d$ (distance), ${\tt assert}~o_i~{\it isSupportedBy}~o_j$ (gravity), and $\neg{\tt collides}(o_i,o_j)$ (collision avoidance).

Logical Composition:

Constraints can be conjuncted, disjuncted, or negated.

Name resolution and type checking:

The IR maintains a symbol table for consistent resolution and ensures that all assignment and assertion types are correct.

Systematic Verification:

For every candidate layout, ScenethesisLang permits both syntactic (type, name, and structure) and semantic (constraint satisfaction, boundary inclusion) checks via the $\llbracket \cdot \rrbracket_L$ evaluation function over scene layouts (Li et al., 24 Jul 2025).

5. Editing, Traceability, and Targeted Modifications

One of the IR's primary strengths is support for targeted scene modification and round-trip traceability. Developers and downstream modules can identify, update, or replace any aspect (object attributes, constraints) in the DSL without needing to re-synthesize unaffected scene components. The generated IR is embedded verbatim as metadata into the output (e.g., Unity) enabling reverse mapping to requirements or user queries (Li et al., 24 Jul 2025). Independent re-execution of downstream synthesis or constraint-solving stages after any localized edit enables efficient iteration and supports diverse use cases, including UI design, robotics, and simulation.

6. Evaluation and Empirical Results

ScenethesisLang enables high-fidelity, controllable, and verifiable 3D scene synthesis at scale. Key empirical results include:

Requirement Capture:

F1 > 80% for τ = 0.8; object constraint F1 > 97%, layout constraints ≈70%.

Constraint Satisfaction:

≥93% satisfied after 5 solver iterations (k = 3, T = 5), supporting over 100 concurrent constraints.

Scene-Query Coherence:

BLIP-2 score improvement of 42.8% versus previous methods (mean 74.3 vs. 52.0).

User Studies:

Higher layout coherence (4.12 vs. 3.68), spatial realism (3.89 vs. 3.42), and overall consistency (4.05 vs. 3.61) compared to Holodeck (n=20) (Li et al., 24 Jul 2025).

Code Example:

A conference table and lamp specification:

region Hall; object table; table.scale <- vec3(2.0, 0.75, 1.0);
object lamp; lamp.features <- "modern pendant";
assert lamp.pos.y > table.pos.y + table.scale.y + 0.2;
assert |lamp.pos.x - table.pos.x| < 0.1;
assert |lamp.pos.z - table.pos.z| < 0.1;
assert ¬collides(lamp, table);

This fine-grained, constraint-first approach yields scenes that maintain both the high-level intent and low-level physical and geometric correctness, validated by both automated metrics and human assessment (Li et al., 24 Jul 2025).

ScenethesisLang has catalyzed a broader class of constraint-aware scene representations. Related frameworks include:

SceneScript: A sequence-based parametric language with extensible commands for walls, doors, windows, and objects, used for layout estimation and 3D object detection (Avetisyan et al., 2024).
Programmatic Scene Languages: Hierarchical DSLs (e.g., HDSL) that leverage tree-structured representations for localized editing, hierarchical planning, and verifiable spatial validation (Li et al., 8 Jun 2026).
Scene-to-Language Pipelines: Reverse direction frameworks (e.g., Text-Scene) that parse 3D geometry into explicit spatially-grounded textual graphs, supporting downstream reasoning and planning (Li et al., 20 Sep 2025).

Unlike methods which rely solely on learned spatial priors (e.g., language-driven diffusion models), ScenethesisLang formalizes both real-world continuous and logical constraints and exposes all decision points for verification and revision.

By systematically bridging language, spatial semantics, and formal specification, ScenethesisLang establishes a foundation for programmatic, transparent, and verifiable synthesis and manipulation of complex 3D software and environments (Li et al., 24 Jul 2025, Ling et al., 5 May 2025, Li et al., 20 Sep 2025).