MonoScene: Monocular 3D Semantic Scene Completion (2112.00726v2)

Published 1 Dec 2021 in cs.CV, cs.AI, and cs.RO

Abstract: MonoScene proposes a 3D Semantic Scene Completion (SSC) framework, where the dense geometry and semantics of a scene are inferred from a single monocular RGB image. Different from the SSC literature, relying on 2.5 or 3D input, we solve the complex problem of 2D to 3D scene reconstruction while jointly inferring its semantics. Our framework relies on successive 2D and 3D UNets bridged by a novel 2D-3D features projection inspiring from optics and introduces a 3D context relation prior to enforce spatio-semantic consistency. Along with architectural contributions, we introduce novel global scene and local frustums losses. Experiments show we outperform the literature on all metrics and datasets while hallucinating plausible scenery even beyond the camera field of view. Our code and trained models are available at https://github.com/cv-rits/MonoScene.

Citations (206)

View on Semantic Scholar

Summary

The paper introduces a unique method using consecutive 2D and 3D UNets with a novel Features Line of Sight Projection (FLoSP) to integrate 2D features into 3D representations.
The work employs tailored loss functions, including Scene-Class Affinity Loss and Frustum Proportion Loss, to enhance semantic precision and spatial coherence.
Experimental results on NYUv2 and SemanticKITTI demonstrate significant improvements in IoU and mIoU, validating effective 3D reconstruction from monocular images.

Overview of MonoScene: Monocular 3D Semantic Scene Completion

The paper "MonoScene: Monocular 3D Semantic Scene Completion" by Anh-Quan Cao and Raoul de Charette addresses the challenging task of inferring dense 3D geometry and semantics from a single monocular RGB image. This work departs from traditional methods in Semantic Scene Completion (SSC) that typically rely on depth data or 3D inputs, such as point clouds or occupancy grids. The proposed framework leverages deep learning architectures with innovative components to tackle both indoor and outdoor scenes, making it a versatile contribution in the field of computer vision.

Methodological Contributions

MonoScene introduces a unique approach by using consecutive 2D and 3D UNets, connected through a novel Features Line of Sight Projection (FLoSP) mechanism. This approach facilitates the flow of 2D features into a 3D representation by projecting them along their optical paths, allowing the 3D network to self-discover relevant 2D features. The 3D network architecture is further enhanced with a Context Relation Prior (3D CRP) component designed to incorporate a global receptive field and improve spatio-semantic coherence.

The researchers have also introduced new loss functions tailored to optimize SSC performance. The Scene-Class Affinity Loss (SCAL) is employed to refine the precision, recall, and specificity of the semantic predictions globally. Meanwhile, the Frustum Proportion Loss addresses local semantic distributions within specific 3D frustums, enhancing the ability to extrapolate details beyond the visible field of view.

Experimental Evaluation

The paper presents comprehensive experimental results demonstrating MonoScene’s superiority over several adapted baseline methods on benchmarks such as NYUv2 and SemanticKITTI.

On the NYUv2 dataset, MonoScene achieves an IoU of 42.51% and an mIoU of 26.94%, outperforming other RGB-inferred methods. Similarly, evaluations on SemanticKITTI indicate an mIoU improvement over competitors, showcasing the framework's robustness across different scene complexities and input settings.

Qualitative results reinforce these findings, with MonoScene adeptly reconstructing occluded geometries and hallucinating plausible scene structures beyond the immediate view captured by the camera.

Implications and Future Directions

MonoScene's approach opens important avenues in leveraging monocular vision for comprehensive 3D scene understanding. The method's applicability in both indoor and outdoor settings enhances its potential utility in diverse real-world applications, such as augmented reality and autonomous navigation, where compact and cost-effective solutions are desirable.

Future research could explore integrating other modalities or further refining the back-projection mechanisms to enhance the accuracy of smaller or semantically-similar object classes. Moreover, adapting this framework across varying environmental conditions and camera types could further validate and expand its applicability.

Overall, this work provides a significant step toward simplifying and broadening the reach of 3D semantic scene inference, offering both practical toolsets for current applications and foundational insights for future technological developments in AI-driven scene reconstruction.

PDF Markdown

Related Papers

GitHub

GitHub - astra-vision/MonoScene: [CVPR 2022] "MonoScene: Monocular 3D Semantic Scene Completion": 3D Semantic Occupancy Prediction from a single image (688 stars)

YouTube

Show All Videos