RoomEditor++: Diffusion-Based Indoor Scene Synthesis
- RoomEditor++ is a diffusion-based indoor scene synthesis system that achieves high-fidelity virtual furniture insertion through a shared dual diffusion backbone.
- It leverages a parameter-sharing dual diffusion architecture combining U-Net and Diffusion Transformer for effective inpainting, texture preservation, and precise geometric alignment.
- Benchmarking on RoomBench++ demonstrates superior reconstruction metrics and robust cross-dataset generalization, eliminating the need for task-specific fine-tuning.
RoomEditor++ is a diffusion-based virtual furniture and indoor scene synthesis architecture that achieves high-fidelity, geometry-consistent furniture insertion, inpainting, and scene editing, leveraging a parameter-sharing dual diffusion backbone. Designed to address limitations in previous furniture synthesis and compositional room editing techniques, RoomEditor++ is supported by RoomBench++, a large-scale benchmark with real and realistic scene pairs. This system enables precise geometric alignment, seamless texture and identity preservation, and strong generalization across diverse scenes without task-specific fine-tuning (Wang et al., 19 Dec 2025). The RoomEditor++ design framework has influenced and integrated insights from approaches such as layout-controllable mesh generation (Feng et al., 9 Sep 2024), LLM programmatic mesh and texture editing (Kim et al., 21 Jun 2025), and compositional layout graph diffusion (Zheng et al., 3 Oct 2024), consolidating the state of the art for interactive 2D/3D room-content synthesis.
1. Problem Formulation and Architectural Principles
RoomEditor++ addresses the task of virtual furniture synthesis and geometric inpainting. Let be the ground-truth composite image, with binary mask specifying the inpainting region and its complement covering the reference object. The masked reference and masked background are derived as and , respectively.
Given noisy inputs (, ), the model learns to reconstruct , inpainting while pasting an aligned reference from .
The architecture centers on a parameter-sharing dual-diffusion backbone. Both and pass through the same backbone , with explicit feature interaction at every block. For U-Net, hierarchical 2D feature maps are processed with local ResNet + LayerNorm, self-attention (reference), mixture-attention (background over both ref/bg), GELU-based FFN, and residual connections. For DiT (Diffusion Transformer), tokenized sequences are processed via AdaLN-Zero, sequence attention, MLP branches, and gated residuals.
The core innovation is parameter sharing: both reference and background features are processed by weights shared across both streams, with interaction modules guaranteeing feature alignment—an essential property for geometric coherence.
2. Learning and Loss Functions
RoomEditor++ is trained with a single DDPM inpainting/denoising loss:
No explicit adversarial or perceptual loss is used in the main configuration, though these can be included as options.
RoomBench++ supports training with a variety of augmentations (horizontal flip, rotation, scaling, random crop; mask dilation/erosion, boundary blurring, etc.) and includes both real-scene (automatically segmented indoor video frames) and realistic-rendering subsets (professional design renders with precise foreground/background mask annotation).
Two primary backbone instantiations are supported:
- RoomEditor++(U): U-Net, blocks, initialized from Stable Diffusion 1.5 inpainting.
- RoomEditor++(DiT): 57-block Diffusion Transformer, initialized from FLUX.1 Fill.
Training is performed with Prodigy optimizer, weight decay 0.01, input resolution 768×768, LoRA rank 256 for efficient fine-tuning, batch size 32 on four A800 GPUs.
3. Dataset: RoomBench++ and Annotation Structure
RoomBench++ comprises 112,851 training and 1,832 test image pairs, subdivided into:
- Realistic-scene subset: Renderings with precisely annotated masks.
- Real-scene subset: Extracted from public indoor videos, segmenting repeated objects across frames with Sa2Va + CLIP/DINO-based filtering and clustering.
Furniture category coverage includes chairs, sofas, tables, beds, cabinets, etc. Masks are annotated for both training (sub-pixel) and test (looser, closely reflecting practical deployment). Real-subset test curation enforces pose diversity. The dataset supports rigorous probing of geometry, appearance, and compositional generalization (Wang et al., 19 Dec 2025).
RoomBench++ provides real-world and physically plausible test cases that expose geometric misalignment, synthetic artifacts, and texture transfer failures that are not captured in smaller, overfitted benchmarks.
4. Evaluation Metrics and Empirical Findings
Quantitative evaluation employs:
- Reconstruction: SSIM↑, PSNR↑
- Perceptual: FID↓, LPIPS↓
- Semantic: CLIP-score↑, DINO-score↑
RoomEditor++(DiT) achieves state-of-the-art FID (full set: 11.49 vs DreamFuse's 12.67), SSIM (0.905 vs 0.886), PSNR (26.82 vs 25.52), LPIPS (0.067 vs 0.079), CLIP (91.39 vs 90.67), DINO (91.03 vs 89.15). Qualitative outputs demonstrate preservation of reference object identity, tight geometric alignment (no scale/pose drift), and realistic textural harmonization.
Human preference studies (N=20, 100 test cases) record RoomEditor++ as top-ranked in 35.5% for fidelity, 31.0% for harmony, and 33.2% for overall quality (mean rank 1.69–1.89), significantly outperforming MimicBrush and DreamFuse baselines.
Ablation studies confirm that parameter-sharing reduces feature error at all backbone layers. Removing this sharing (i.e., two separately trained U-Nets) degrades FID from 11.93 to 13.12, demonstrating architectural necessity. Adding auxiliary vision encoders (CLIP, SigLIP) provides marginal gains but at increased computational cost.
5. Generalization: Cross-Dataset Transfer and Feature Consistency
RoomEditor++ exhibits robust cross-dataset generalization:
- 3D-FUTURE pseudo-pairs: FID of 5.03 vs. baselines >14, SSIM 0.826 vs 0.667, LPIPS 0.110 vs 0.252.
- DreamBooth (out-of-domain objects): FID 57.64 vs DreamFuse 58.98, SSIM 0.728 vs 0.591.
Without any fine-tuning, RoomEditor++ adapts to novel furniture instances and even arbitrary object insertion, retaining plausible geometry and seamless content integration. This suggests the spatial and feature-alignment constraints encoded by parameter-sharing provide priors for generalizable compositionality.
Feature consistency analyses show that weight sharing enforces nearly isometric reference-background feature flows, enabling precise geometric transforms and appearance transfer required by challenging test set scenarios.
6. Integration with Contemporary and Complementary Approaches
RoomEditor++’s architecture and evaluation protocol have influenced, and been influenced by, parallel advances in controllable 3D mesh synthesis (Feng et al., 9 Sep 2024), LLM-driven visual programming and module orchestration (Kim et al., 21 Jun 2025), and compositional graph diffusion for layout editing (Zheng et al., 3 Oct 2024).
- Prim2Room (Feng et al., 9 Sep 2024) suggests incorporating adaptive viewpoint selection and non-rigid depth registration. Primitive retrieval (based on CLIP and aspect-ratio matching), mask-based control inpainting, and edit propagation enable highly flexible, user-interactive workflows, which can be embedded into RoomEditor++ as submodules or parallel pipelines.
- Programmable-Room (Kim et al., 21 Jun 2025) demonstrates the utility of LLM (GPT-4)-generated module programs for flexible pipeline assembly. Analytical depth/spherical projections, multi-modal diffusion for panoramic texture, and CSS-style furniture arrangement via LayoutGPT can augment RoomEditor++'s functionality and increase extensibility.
- EditRoom (Zheng et al., 3 Oct 2024) introduces LLM-parameterized command planning and unified graph diffusion for 3D layout editing, revealing that integration of semantic layout graphs and multi-turn edit sequences (add/delete/transform/replace) further empowers language-guided, multi-step editing pipelines.
A plausible implication is that future RoomEditor++ variants can actively compose programmable visual pipelines, fuse primitive- and graph-based representations, and exploit large-scale language-conditioned data to further enhance compositional flexibility and real-time interaction.
7. Limitations and Prospects
Despite robust quantitative and qualitative performance, RoomEditor++ inherits some structural limitations from the broader class of diffusion-based and LLM-driven approaches:
- Although object collisions and geometric consistency are handled via mixture-attention and prompt-based constraints, physics-aware or constraint-based explicit postprocessing is not yet embedded.
- Appearance/style transfer is not directly conditioned—future directions include joint geometry-appearance diffusion and explicit collision-avoidance in sampling.
- Rendering at high resolution remains computationally intensive; progressive refinement, latent-space diffusion, or view-conditioned acceleration can mitigate latency for interactive editing.
- Dataset limitations (RoomBench++ focus on bedrooms/living rooms) may restrict generalization to more complex room types (e.g., kitchens, offices) without further fine-tuning or expanded annotated data.
Future work is likely to explore tighter LLM–diffusion integration with joint spatial reasoning fine-tuning, more expressive compositional constraints, real-time edit propagation, and expanded semantic scope of recognizable and editable object categories.
RoomEditor++ represents a parameter-sharing, diffusion-based furniture and indoor scene synthesis system benchmarked on RoomBench++, delivering state-of-the-art fidelity, geometry alignment, and generalization for photorealistic room composition and interactive editing (Wang et al., 19 Dec 2025). Its architecture and empirical findings inform ongoing methodological synthesis at the intersection of generative diffusion, language-guided program synthesis, and 3D layout controllability.