Estimating Generic 3D Room Structures from 2D Annotations (2306.09077v2)

Published 15 Jun 2023 in cs.CV and cs.GR

Abstract: Indoor rooms are among the most common use cases in 3D scene understanding. Current state-of-the-art methods for this task are driven by large annotated datasets. Room layouts are especially important, consisting of structural elements in 3D, such as wall, floor, and ceiling. However, they are difficult to annotate, especially on pure RGB video. We propose a novel method to produce generic 3D room layouts just from 2D segmentation masks, which are easy to annotate for humans. Based on these 2D annotations, we automatically reconstruct 3D plane equations for the structural elements and their spatial extent in the scene, and connect adjacent elements at the appropriate contact edges. We annotate and publicly release 2246 3D room layouts on the RealEstate10k dataset, containing YouTube videos. We demonstrate the high quality of these 3D layouts annotations with extensive experiments.

Authors (5)

Denys Rozumnyi (19 papers)
Stefan Popov (12 papers)
Kevis-Kokitsi Maninis (24 papers)
Matthias Nießner (177 papers)
Vittorio Ferrari (83 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel approach that leverages 2D segmentation masks from RGB videos to accurately reconstruct 3D room structures.
The methodology employs 2D point tracking, 3D plane estimation with a joint loss function, and rigorous quality assurance via reprojection IoU.
Evaluation on 2246 scenes achieved an average reprojection IoU of 0.9 and a depth error of about 20 cm, showcasing high reconstruction fidelity.

Estimating Generic 3D Room Structures from 2D Annotations

This paper addresses the challenge of estimating 3D room structures from commonly available 2D video data. The authors propose a novel method that bypasses the need for complex and costly 3D data acquisition systems by leveraging easily annotatable 2D segmentation masks. This method facilitates the reconstruction of 3D room layouts using simple RGB videos, thereby expanding the accessibility and potential applications of 3D scene understanding.

Methodology Overview

The paper introduces a systematic approach to derive 3D room layouts from 2D annotations. The core innovation lies in utilizing human-drawn amodal segmentation masks from video frames as the starting point. These masks denote the entire surface of structural elements like walls, floors, and ceilings, even those occluded by furniture or other objects. The annotation process simplifies the task by asking annotators to focus only on visible parts in the frames and mark occlusion edges. This data is then processed through a pipeline consisting of several key computational stages:

2D Point Tracking: Track 2D points across video frames to maintain correspondences between structural elements. This involves using techniques like RAFT for optical flow to inform the 3D reconstruction process.
3D Plane Estimation: Estimate 3D plane equations for each structural element. This involves minimizing a joint loss function that accounts for fitting the 3D plane to tracked points, matching 2D edge annotations with 3D plane intersections, and ensuring perpendicularity between walls and floors/ceilings.
Estimating Spatial Extent: The computed plane equations are used to infer the spatial extent of each element. The extent is calculated as the union of all observed parts throughout the video, then refined to correct artifacts and ensure completeness.
Quality Assurance: Employ automatic quality control mechanisms to filter out poor reconstructions based on Reprojection IoU with the 2D annotations. This ensures high fidelity of the final datasets.

Dataset Creation and Evaluation

Utilizing the method, the authors annotated 2246 scenes from the RealEstate10k dataset, composing a substantial real-world dataset of 3D room structures from RGB videos. The quality of the resulting annotations was verified via comparisons to ground-truth data on the ScanNet dataset, achieving an average Reprojection IoU of 0.9 and a depth error of about 20 cm in 7 meter wide rooms. The authors also performed a thorough manual inspection to ensure high reconstruction recalls and precisions.

Implications and Future Directions

This method stands to significantly enhance 3D scene understanding by democratizing the creation and use of high-quality 3D datasets. The approach's reliance on simple 2D annotations rather than complex and expensive 3D acquisition setups could revolutionize the accessibility of 3D data across various domains, including virtual reality, robotics, and indoor navigation systems.

Future advancements may involve extending this methodology to accommodate non-planar surfaces or enhancing the extrapolation capabilities for reconstructing unobserved parts of rooms. Additionally, integrating this approach with machine learning models could refine the accuracy and efficiency of automated 3D layout estimation. The released dataset provides a rich platform for further research and development in 3D computer vision, promising to drive innovative methods and applications in the field.

PDF Markdown

Related Papers

GitHub

GitHub - google-research/cad-estate (116 stars)