RoomBench++: Benchmark for Furniture Synthesis
- RoomBench++ is a large-scale benchmark dataset designed for high-fidelity virtual furniture synthesis by integrating professionally rendered and real-world indoor scene data.
- It combines precisely annotated realistic renderings with challenging real-scene videos to enable robust evaluation of geometric coherence, fine-grained texture preservation, and background inpainting.
- The benchmark supports standardized evaluation of methods like RoomEditor++ using metrics such as FID, SSIM, PSNR, and LPIPS to assess seamless virtual furniture integration.
RoomBench++ is a comprehensive, large-scale benchmark dataset specifically designed for the evaluation and advancement of high-fidelity furniture synthesis in indoor scenes. It enables principled assessment of virtual furniture insertion, requiring seamless geometric and visual integration of reference objects within a target scene. RoomBench++ addresses longstanding gaps due to the scarcity of reproducible benchmarks and the limitations of existing methodologies in handling geometric coherence, fine-grained texture preservation, and robust background inpainting. The dataset forms the empirical foundation for the evaluation of RoomEditor++, a diffusion-based architecture, and state-of-the-art composition frameworks (Wang et al., 19 Dec 2025).
1. Dataset Construction and Subset Composition
RoomBench++ integrates two complementary data sources, each with distinct characteristics regarding visual style, annotation precision, and scene variability:
- Realistic-scene subset: Extracted from professionally produced home-design renderings by major retailers (IKEA, West Elm, World Market). Designers' renderings are manually classified into "furniture foreground" and "empty-room background," then filtered using GPT-4o to exclude occluded, partially visible, or style-mismatched pairs. Pixel-wise object masks are manually annotated, providing sub-pixel precise masks for training and deliberately coarse masks in the test split to simulate user-provided labels.
- Real-scene subset: Curated from publicly available indoor walkthrough videos sourced from real-estate platforms (Beike, Lianjia). The automated pipeline consists of:
- Frame extraction with motion blur filtering ( or ).
- Furniture detection via Qwen2.5-VL.
- Pixel-wise segmentation using Sa2Va, retaining only masks where the largest connected component exceeds 95% of the area.
- Category verification via Qwen2.5-VL for exclusion of mis-segmented non-furniture objects.
- Object-instance clustering across frames using DINOv2 features to ensure reference and target images correspond to the same physical object.
- Held-out test curation favoring sequences with substantial pose or viewpoint changes to robustly evaluate geometric transformation.
In both subsets, each sample comprises:
A background image with a binary mask removing the furniture region.
- A reference image depicting the target furniture ().
In the realistic subset, designer renderings enforce accurate scale, color, and viewpoint alignment. In the real subset, geometric coherence arises naturally from multi-view video frames, introducing real-world pose, lighting, and occlusion variations. No explicit 3D coordinates are provided; geometry is implicitly captured through image pairing.
2. Dataset Statistics and Annotation Protocol
Data Splits and Category Coverage
The RoomBench++ dataset is divided as follows:
| Split | Total Pairs | Realistic | Real |
|---|---|---|---|
| Training | 112,851 | 7,298 | 105,553 |
| Testing | 1,832 | 895 | 937 |
The realistic-scene subset covers eight major furniture categories: sofas, chairs, tables, beds, cabinets, shelves, desks, and decor, each with 800–1,200 reference images paired with 1–2 matching backgrounds. The real-scene subset captures a broad spectrum of household furniture, with dominance of chairs and tables, drawn from diverse indoor settings ranging from sparsely furnished to highly cluttered rooms.
Annotations and Metadata
Annotations include high-precision or coarse pixel-wise object masks, furniture category labels, and pairing information indicating frame correspondences. The dataset does not provide explicit 3D information; instead, geometric factors—pose, scale, and viewpoint—are implicitly represented through paired multi-view samples.
3. Benchmark Task Definition and Evaluation Metrics
Primary Task
Given a masked background image () and a masked reference furniture image (), the objective is to generate a composite image () that seamlessly integrates the reference furniture with contextually and geometrically accurate placement, preserving scene integrity. Auxiliary targets include background inpainting, faithful texture transfer, and style alignment.
Evaluation Metrics
Quantitative performance is evaluated on six metrics:
- Fréchet Inception Distance (FID):
where and denote mean and covariance of Inception embeddings for generated/real images.
- Peak Signal-to-Noise Ratio (PSNR):
- Structural Similarity Index (SSIM):
- Learned Perceptual Image Patch Similarity (LPIPS): Deep-feature distance metric.
- CLIP-score and DINO-score: Cosine similarity of image embeddings between generated and ground-truth targets.
Evaluation is conducted on the full held-out test set (), at resolution with per-image averaging. Optionally, human perceptual studies are conducted by sampling 100 test cases, collecting fidelity, harmony, and overall-quality rankings.
4. Benchmarked Methods and Comparative Results
RoomBench++ is a primary benchmark for comparing virtual furniture synthesis algorithms. The major baselines are:
- AnyDoor (encoder-based diffusion)
- MimicBrush (dual-U-Net diffusion)
- DreamFuse (DiT-based fusion)
- RoomEditor++ (parameter-sharing dual diffusion backbone with U-Net or DiT variants)
Key Results
| Model | FID (Realistic) | FID (Full Test) | SSIM | PSNR | LPIPS |
|---|---|---|---|---|---|
| AnyDoor (pretrained) | 28.03 | 22.84 | — | — | — |
| MimicBrush (pretrained) | 22.50 | 23.34 | — | — | — |
| DreamFuse (pretrained) | 18.50 | 20.30 | — | — | — |
| AnyDoor (ft.) | 24.71 | 19.55 | — | — | — |
| MimicBrush (ft.) | 17.58 | 14.54 | — | — | — |
| DreamFuse (ft.) | 16.85 | 12.67 | — | — | — |
| RoomEditor++ (U-Net) | 16.22 | 11.93 | 0.891 | 26.10 | 0.071 |
| RoomEditor++ (DiT) | 15.88 | 11.49 | 0.905 | 26.82 | 0.067 |
In human preference studies (100 examples, 20 annotators), RoomEditor++ received 35.5% “best-rank” votes (avg. rank ≈ 1.7), compared to DreamFuse (22.1%, avg. rank ≈ 2.6), indicating superior perceived fidelity, harmony, and quality.
5. Data Usage Guidelines and Recommended Practices
Code and data are distributed under a permissive academic license (https://github.com/stonecutter-21/roomeditor), subject to original retailers' Terms of Service—explicitly prohibiting commercial redistribution of proprietary images and circumvention of access restrictions.
Best practices for training and evaluation include:
- Application of provided augmentations: horizontal flip, rotation (≤ 30°), scaling (±20%), random cropping (≥ 75% preserved), and mask perturbations (dilation/erosion, blurring, bounding-box substitution, each with 25% probability).
- Use of input resolution 768×768, Prodigy optimizer with safeguard warmup, weight decay 0.01, and LoRA fine-tuning (rank 256).
- For reproducibility and fair comparison, reporting all six evaluation metrics over the full held-out test set, optionally supplemented by human perceptual rankings.
6. Significance and Research Applications
RoomBench++ supplies a ready-to-use, dual-source benchmark with extensive annotation fidelity and realistic scene diversity. It enables standardized, large-scale training and rigorous evaluation of methods targeting high-fidelity indoor scene editing and furniture synthesis. Its design—incorporating both controlled renderings and the complexity of real-world videos—supports robust generalization and addresses challenges of geometric alignment, photorealistic blending, and preservation of background integrity. All results in the associated literature leverage RoomBench++ as the standard for empirical comparison in virtual furniture synthesis research (Wang et al., 19 Dec 2025).