MVRoom: Layout-Conditioned Multi-View Diffusion

Updated 7 December 2025

MVRoom is a layout-conditioned multi-view diffusion pipeline that synthesizes controllable and photorealistic 3D indoor scenes.
It leverages a novel layout-aware epipolar attention mechanism and multi-layer spatial priors to ensure spatially coherent renderings across multiple views.
The system uses an iterative, recursive generation framework with depth-guided reprojection for precise scene assembly in VR/AR and design applications.

MVRoom is a two-stage, layout-conditioned, multi-view diffusion pipeline for controllable novel view synthesis (NVS) in 3D indoor scenes. It leverages a user-editable, coarse 3D layout and an initial style image to synthesize high-fidelity, multi-view-consistent scene renderings. MVRoom’s architecture is designed to address the challenges faced by prior text-to-3D and single-image NVS methods in generating spatially coherent multi-object scenes under explicit layout control. Central to its approach is a novel layout-aware epipolar attention mechanism and an iterative, recursive generation framework supporting variable scene complexity and object composition (Fang et al., 3 Dec 2025).

1. Motivation and Problem Setting

Contemporary applications in augmented and virtual reality, gaming, and film production demand high-quality, 3D indoor scenes. Manual asset creation is inefficient, and while text-to-3D or single-image approaches perform well for isolated objects, they fail to ensure multi-view consistency and user-directed spatial control for full scenes. MVRoom directly addresses these limitations by conditioning a multi-view diffusion model on a coarse 3D layout containing oriented object bounding boxes and an initial style image. The core objective is the controllable generation of multiple, spatially consistent views that together describe an entire indoor scene, facilitating downstream scene editing and layout specification by end-users.

2. Two-Stage Generation Pipeline

MVRoom’s pipeline operates in two sequential stages:

Stage 1: Image-Based Layout Encoding
- Multi-layer semantic prior ( $P^i_{sem} \in \mathbb{R}^{H\times W\times m}$ ): ray intersections with layout primitives for object-class encoding.
- Multi-layer depth prior ( $P^i_{depth} \in \mathbb{R}^{H\times W\times m}$ ): metric depth for each intersection.
- Local spatial embedding ( $P^i_{local} \in \mathbb{R}^{H\times W\times 3}$ ): $(u,v)$ coordinates and face index for visible box faces.
- Global spatial embedding ( $P^i_{global} \in \mathbb{R}^{H\times W\times 3}$ ): world-space coordinates of surface points.

Simultaneously, an initial image $X_0$ is warped to each pose $p^i$ through depth-guided reprojection:

$\hat{X}^i = {\rm proj}\left(X_0\,;\; p_0, p^i,\, \hat{D}_0\right),$

where $\hat{D}_0$ is the depth prediction for $X_0$ , aligned with the layout.

Stage 2: Multi-View Diffusion with Layout-Aware Epipolar Attention The system builds upon a frozen Stable Diffusion v2-1 latent U-Net. For $N$ poses, it samples $N$ latent images, denoised jointly using information from all views. The model’s innovation is the insertion of a layout-aware epipolar attention module in cross-view fusion blocks. The diffusion training loss is:

$\mathcal{L}_{\rm diffusion} = \mathbb{E}_{t,\,\mathbf{z}_0,\,\epsilon} \left\|\epsilon -\epsilon_\theta\left(\mathbf{z}_t,\{\hat{X}^i\},\{P^i\},\{p^i\}\right)\right\|^2\,.$

Cross-view feature matching is restricted, via masking, to epipolar line segments consistent with the 3D layout, ensuring correspondence and multi-view consistency:

$w_{ij} = \mathrm{softmax}\left(\frac{q_i\cdot k_j^\top}{\sqrt d}\right), \quad j\in \mathrm{EpiLayout}(i).$

3. Iterative Scene Completion and Text-to-Scene Generation

MVRoom supports recursive scene completion by operating along camera trajectories automatically generated from $\mathcal{L}$ , ensuring coverage of the entire room while handling variable scene complexity and object count. At each iteration, the global point cloud accumulator guides pose selection, layout priors and masked references are rendered, and new multi-view blocks are synthesized. Scene-wide consistency is promoted by updating the point cloud with successfully synthesized and depth-validated views. Upon completion, a 3D-Gaussian-Splatting model is optimized using the synthesized images and associated depth, enabling dense, photorealistic 3D reconstruction.

4. Training Paradigm and Architectural Specifics

MVRoom is trained solely with the diffusion noise-prediction loss in the latent space:

$\mathcal{L}_{\rm diffusion} = \mathbb{E}_{t,\,\mathbf{z}_0,\,\epsilon} \|\epsilon - \epsilon_\theta(\mathbf{z}_t, \rm{conds})\|^2,$

with $\epsilon \sim \mathcal{N}(0,I)$ and “conds” denoting all per-view condition signals. The backbone is a latent U-Net (Stable Diffusion v2-1), frozen during training. Only the “T2i-Adapters” (layout adapters) and the cross-attention modules introduced for layout and image fusion are trained. Warp-injected image conditions are encoded via the stable-diffusion text branch; layout priors are injected at each resolution through FiLM-modulated adapters. Multi-view latent batches are fused at each step via concatenation along a “view” axis, processed with the layout-aware epipolar attention for consistent denoising.

5. Empirical Evaluation and Ablation

Experiments utilize the 3D-Front dataset, filtered to 6,287 complex indoor scenes with approximately 700 synthesized viewpoints per room. Quantitative evaluation includes:

Inception Score (IS) for perceptual image quality.
PSNR / SSIM for novel view fidelity vs. ground truth.
User study (N=20) scoring Perceptual Quality (PQ), 3D Consistency (3DC), Layout Plausibility (LP) on a 1–5 scale.

Quantitative Comparison — Multi-View Fusion Ablation

Method	IS	PSNR	SSIM
MVDiffusion	4.035	20.45	0.7247
Correspondence-Aware	4.113	20.72	0.7435
3D Self-Attention	4.232	22.06	0.7979
Plain Epipolar	4.129	22.12	0.8077
LA-Epipolar (Ours)	4.170	22.66	0.8154

User Study vs. Baselines

Method	PQ	3DC	LP
Text2Room	1.99	2.84	2.84
LucidDreamer	1.93	2.72	2.78
Set-the-Scene	1.53	2.79	2.59
MVRoom	4.59	4.37	4.21

Ablation reveals:

Replacing layout-aware epipolar attention reduces SSIM by 1–2 points (0.8154→0.8077/0.7435).
Multi-layer layout priors improve PSNR by ~0.5 dB over single-layer.
Spatial embeddings and depth alignment yield incremental gains, with depth alignment providing ~1.7 dB improvement.

6. Limitations, Extensions, and Application Scenarios

Key limitations include reliance on synthetic or text-generated initial images and the constraint imposed by a frozen Stable Diffusion backbone, which curtails multi-view coherence when compared to emerging video diffusion models.

Potential extensions involve:

Adapting the architecture to real-world RGB-D data through joint fine-tuning of the backbone and layout adapters.
Integrating large-scale video diffusion models to improve temporal and multi-view consistency.
Supporting dynamic scenes through temporal conditioning.

MVRoom’s utility spans rapid, layout-guided prototyping in architectural design, interactive VR/AR content generation, game-level design, and virtual real-estate staging with explicit 3D layout input.

7. Summary and Significance

MVRoom implements a flexible, controllable method for high-fidelity, 3D indoor scene synthesis by decomposing the task into layout-conditioned multi-view image generation and recursive spatial completion. The introduction of layout-aware epipolar attention and multi-layer spatial priors results in state-of-the-art multi-view and layout consistency. These capabilities position MVRoom as an advanced tool for controllable, photorealistic scene assembly in research and industry, surpassing existing methods in both objective and human-evaluated metrics (Fang et al., 3 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MVRoom.