SynthFun3D: Functional 3D Scene Synthesis

Updated 5 December 2025

SynthFun3D is a method for task-based 3D scene synthesis that creates indoor environments with precise functional element segmentation.
It integrates LLM-based prompt parsing, multi-database retrieval, and constraint-driven layout optimization to ensure scene-prompt coherence.
Empirical results demonstrate improved segmentation accuracy and scalability, offering a cost-effective alternative to manual real-world annotations.

SynthFun3D is a method for task-based 3D scene synthesis, designed to generate indoor environments tailored to specific functionality prompts expressed in natural language. SynthFun3D enables the creation of annotated 3D data with a focus on functional elements—such as handles, switches, or knobs—necessary for specified actions. It addresses the limitations of prior object-centric synthesis frameworks by supporting fine-grained part-level reasoning and segmentation within synthesized scenes. The pipeline incorporates LLM reasoning, multi-database retrieval, constraint-driven layout optimization, and scalable annotation. Quantitative and qualitative studies indicate that SynthFun3D substantially improves scene-prompt coherence and functional element segmentation while maintaining cost-effectiveness and scalability compared to manually annotated real-world datasets (Corsetti et al., 28 Nov 2025).

1. Motivation and Background

SynthFun3D is motivated by the requirements of embodied-AI agents, which must learn to perform fine-grained interactions in 3D environments by identifying not just objects but their functional subcomponents. Existing datasets, such as SceneFun3D, offer part-level masks but are prohibitively expensive to construct and annotate. Prior synthesis methods either prioritize object-level layout (retrieval-based) or appearance (diffusion-based: e.g., SceneFactor, Diffuscene) and cannot ensure that synthesized scenes support the execution of a specific functional task described in language. SynthFun3D pursues the objective of generating 3D scenes in which given functional action descriptions are feasible, with precise retrieval and placement of target functional elements, and delivery of multi-view, multi-instance segmentation data at scale.

2. Technical Pipeline

SynthFun3D’s synthesis pipeline is organized into sequential modules, each orchestrated to fulfill the requirements of the task description $D$ . The overall flow proceeds as follows:

Prompt Parsing: An LLM (e.g., GPT-OSS-20B) processes the natural-language action description and extracts a structured layout prompt $L$ , contextual object name $O$ , object type classification, and a context-free prompt $D^*$ with positional cues removed. This parsing seeds subsequent asset queries and placement constraints.
Asset Retrieval: For each object in the scene, short appearance and size descriptions are produced by the LLM. Two databases are used:
- For non-target/contextual objects: Objaverse ( $U$ ), containing ~818K unannotated meshes with CLIP embedding indices.
- For target/functional objects: PartNet-Mobility ( $S$ ), comprising ~2K articulated household objects with part-level segmentations and semantic labels, indexed by PerceptionEncoder embeddings.

Retrieval scores are computed as:

$s_{\text{text}}(u) = \text{cosine}(E_{\text{text}}(\text{desc}), E_{\text{text}}(\text{meta}_u))$

$s_{\text{img}}(u) = \text{cosine}(E_{\text{img}}(\text{render}_u), E_{\text{text}}(\text{desc}))$

$\text{score}(u) = \lambda \cdot s_{\text{text}}(u) + (1-\lambda) \cdot s_{\text{img}}(u)$

Top- $K$ candidates are selected above a similarity threshold.

Requirement-Based Filtering: The LLM examines functional part labels $L_S$ in candidate assets together with the prompt $D^*$ and returns the required part name $F$ and numerical constraints $r$ (e.g., handle $\geq 3$ ). Assets that do not satisfy these are discarded.
Functional-Arrangement Filtering: For each candidate asset $s \in S$ , the masks $M_s = \{m_i\}$ for functional parts are extracted. 3D centroids $c_i=(x_i, y_i, z_i)$ are normalized into two dimensions $c'_i=(x_i, y_i)$ ; part labels are enriched with hierarchical parent information (e.g., "drawer handle"). The LLM evaluates spatial configurations against $D^*$ and outputs a specific part ID for mask retrieval.
Layout Optimization: All relationships in $L$ are converted into hard constraint clauses (e.g., $object_i <$ relation $> object_j$ ). A two-stage Depth-First-Search (DFS) algorithm, inspired by Holodeck, places mandatory objects under strict constraints and optional context objects for scene variability:

def place_mandatory(constraints):
    if all mandatory objects placed: return solution
    pick next hard constraint c
    for each valid placement p of c:
        apply p
        if place_mandatory(remaining): return solution
        backtrack
    return failure
def place_optional(soft_constraints):
    # Same, but constraints may be skipped.

Rendering and Annotation: The finalized scene is rendered via Blender along multiple sampled camera trajectories, producing RGB frames, multi-instance segmentation masks for all objects, and a functional-element mask for the specific target part. Optionally, photorealistic style transfer is applied using Cosmos-Transfer 2.5, directed by environment and style captions generated through Cosmos-Reason1 and Llama-Nemotron.

3. Asset Databases and Annotation Structure

The pipeline draws from two distinct asset repositories:

Database	Mesh Count	Annotation Level	Indexing
Objaverse ( $U$ )	~818K	None (unannotated)	CLIP/text
PartNet-Mobility ( $S$ )	~2K	Part-level, hierarchy	PerceptionEncoder

Objaverse provides a breadth of categories but lacks part masks. PartNet-Mobility contains articulated objects with detailed segmentation, semantic labels, and parent-child part hierarchies, enabling fine-grained functional reasoning and retrieval.

4. Evaluation Protocols and Empirical Results

SynthFun3D’s outputs are subject to extensive user studies and downstream segmentation benchmarking:

Scene-Prompt Coherence (85 prompts, expert annotators): Success indicates presence of required objects and correct spatial relationships. SynthFun3D achieves $72.7\%$ success, Holodeck achieves $40.5\%$ . Ablations (without requirement filtering or hierarchy metadata) decrease performance to $69.2\%$ and $60.2\%$ , respectively.
Functional Mask Retrieval Correctness (118 prompts, 354 masks): Annotators verify correct segmentation of functional parts. SynthFun3D yields $85.7\%$ correctness versus lower rates for ablations.
Object Retrieval Preference (118 prompts, 547 comparisons, non-expert pairwise forced-choice): SynthFun3D is preferred $38\%$ of the time versus Holodeck's $36\%$ (equal: $26\%$ ), with a higher preference for structural and small elements.
Downstream Functionality Segmentation (SceneFun3D validation; metrics: mAP, AP50/25, mAR, AR50/25, mIoU): Models trained only on SynthFun3D data achieve mIoU $11.99$ vs real-only $14.04$; pretraining on synthetic then fine-tuning on real data yields $16.16$ mIoU, a $+2.5$ gain, further increased to $16.58$ with photorealistic style transfer.

Key findings indicate that synthetic data produced by SynthFun3D offers near-parity with manually annotated real data for functionality segmentation, and supplementation via domain adaptation further improves benchmark performance.

5. Limitations and Future Research Directions

The current system is constrained by the coverage and diversity within Objaverse and PartNet-Mobility; certain object types, colors, or styles are underrepresented. Reliance on canonical object orientations and conventional CAD preprocessing introduces susceptibility to placement or retrieval errors. While photorealistic style transfer with Cosmos-Transfer 2.5 expands visual diversity, it can hallucinate or modify functional elements, sometimes invalidating ground truth masks.

Ongoing research aims to:

Expand the asset corpus and integrate community-sourced models to enhance coverage.
Support multi-step action prompts, advancing beyond single-action functionality synthesis.
Integrate physics-grounded constraints, such as collision detection and articulation feasibility.
Jointly optimize scene synthesis for both perceptual and planning tasks in embodied AI.
Refine prompt engineering and in-model control to mitigate style transfer hallucinations.

6. Significance and Practical Implications

SynthFun3D enables inexpensive, scalable generation of functional 3D scene data—achieving task-driven synthesis at part-level granularity without reliance on supervised learning pipelines. It supports research in functionality understanding for embodied agents, benchmarking of segmentation models, and potentially broadens training resources for a range of data-hungry 3D applications. The pipeline’s training-free architecture, beyond prompt tuning, reduces annotation cost dramatically (compared to ~$25K for SceneFun3D manual annotation), and its modularity allows integration with future methods in scene synthesis and embodied AI. Empirical evidence suggests that synthetic data from SynthFun3D can replace, or meaningfully supplement, real annotated data with minimal loss or measurable gain in segmentation accuracy (Corsetti et al., 28 Nov 2025).

Markdown Upgrade to Chat

References (1)

Language-guided 3D scene synthesis for fine-grained functionality understanding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SynthFun3D.