Single-Image Piece-wise Planar 3D Reconstruction via Associative Embedding (1902.09777v3)

Published 26 Feb 2019 in cs.CV

Abstract: Single-image piece-wise planar 3D reconstruction aims to simultaneously segment plane instances and recover 3D plane parameters from an image. Most recent approaches leverage convolutional neural networks (CNNs) and achieve promising results. However, these methods are limited to detecting a fixed number of planes with certain learned order. To tackle this problem, we propose a novel two-stage method based on associative embedding, inspired by its recent success in instance segmentation. In the first stage, we train a CNN to map each pixel to an embedding space where pixels from the same plane instance have similar embeddings. Then, the plane instances are obtained by grouping the embedding vectors in planar regions via an efficient mean shift clustering algorithm. In the second stage, we estimate the parameter for each plane instance by considering both pixel-level and instance-level consistencies. With the proposed method, we are able to detect an arbitrary number of planes. Extensive experiments on public datasets validate the effectiveness and efficiency of our method. Furthermore, our method runs at 30 fps at the testing time, thus could facilitate many real-time applications such as visual SLAM and human-robot interaction. Code is available at https://github.com/svip-lab/PlanarReconstruction.

Citations (87)

View on Semantic Scholar

Summary

The paper introduces a two-stage approach that employs associative embedding to segment plane instances and estimate their 3D parameters.
The method integrates a CNN with a variant of mean shift clustering for efficient, real-time performance at 30 fps.
Experiments on ScanNet and NYUv2 show that the approach outperforms state-of-the-art methods in accuracy and robustness.

Single-Image Piece-wise Planar 3D Reconstruction via Associative Embedding

The presented work addresses the complex task of single-image piece-wise planar 3D reconstruction. Specifically, the method aims to simultaneously segment plane instances and determine their corresponding 3D plane parameters from a single RGB image. This task is fundamental in computer vision with significant applications in areas such as augmented reality, robotics, and visual SLAM.

Methodology

The paper introduces a two-stage approach based on associative embedding, inspired by its success in instance segmentation and other computer vision tasks. The proposed method capitalizes on the strengths of both bottom-up and top-down approaches to overcome the existing limitations faced by contemporary methods that often require a pre-defined number of planes and specific order.

Stage One: Plane Instance Detection

In the first stage, a Convolutional Neural Network (CNN) is trained to map each pixel into an embedding space where pixels belonging to the same plane have similar embeddings. The network predicts both a planar/non-planar segmentation mask and pixel-level embeddings. To segment plane instances, an efficient variant of the mean shift clustering algorithm is employed. This variant is computationally effective, enabling real-time performance by operating on anchor points in the embedding space rather than on all pixels, thus reducing the complexity considerably.

Stage Two: Plane Parameter Estimation

The second stage focuses on estimating the parameters of each detected plane instance. A plane parameter network is trained to predict plane parameters at the pixel level, which are then aggregated using the segmented plane instances determined in the first stage. This aggregation ensures instance-level geometric consistency and allows for a flexible detection of planes without assumptions about their number or spatial configuration.

Evaluation and Results

The approach is validated through extensive experiments on datasets such as ScanNet and NYUv2. The proposed method demonstrates effectiveness in detecting an arbitrary number of planes and provides superior performance compared to state-of-the-art methods like PlaneNet and traditional methods such as NYU-Toolbox and Manhattan World Stereo.

Significantly, the method achieves a processing speed of 30 frames per second, marking an improvement in efficiency essential for real-time applications. The results also highlight the model's generalizability and robustness in complex indoor environments, outperforming previous methods even when ground truth depth maps are used by competing methods.

Implications and Future Directions

The implications of this research are substantial for the development of real-time applications that require efficient and flexible 3D scene reconstruction from monocular images. The ability to detect and reconstruct scene planes dynamically without predefined constraints offers potential benefits for SLAM systems, particularly in robotics and AR interfaces.

Future research directions may explore the integration of semantic information to further enhance plane segmentation accuracy, especially in instances where similar plane appearances might lead to incorrect segmentation. Additionally, extending the approach to video input could leverage temporal consistency for improved 3D reconstruction quality.

By addressing the limitations inherent in existing methodologies, this paper significantly contributes to the domain of 3D reconstruction, offering both theoretical advancements and practical performance improvements.

PDF Markdown

Related Papers

GitHub

GitHub - svip-lab/PlanarReconstruction: [CVPR'19] Single-Image Piece-wise Planar 3D Reconstruction via Associative Embedding (360 stars)