Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding (2504.06719v1)

Published 9 Apr 2025 in cs.CV and cs.AI

Abstract: Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper addresses this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin. The model and training code can be found at our Github repository (https://github.com/phermosilla/msm).

Authors (3)

Pedro Hermosilla (32 papers)
Christian Stippel (4 papers)
Leon Sick (7 papers)

Summary

Overview of "Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding"

The paper by Pedro Hermosilla, Christian Stippel, and Leon Sick presents a novel approach in the domain of 3D scene understanding, focusing on the development of self-supervised learning models. This research emphasizes the challenge in 3D scene understanding, where self-supervised learning has traditionally been limited, in contrast to its success in 2D computer vision.

Their approach introduces an innovative evaluation protocol and a new self-supervised model, which, for the first time, demonstrates performance competitive with traditional supervised models in 3D scene understanding tasks when using off-the-shelf features. This achievement is significant given the complexity of translating the success of 2D self-supervised methods to 3D spaces.

Key Contributions

Evaluation Protocol: The authors propose a protocol specifically tailored to 3D scenes, leveraging hierarchical models to enable rich point-level representations. This approach addresses common shortcomings in evaluating self-supervised models for 3D tasks, focusing on preserving semantic information through all layers of hierarchical models.
Masked Scene Modeling (MSM): The paper introduces a novel self-supervised framework, which incorporates a bottom-up hierarchical masking mechanism. This technique reconstructs deep features of masked patches, leveraging a multi-resolution strategy that aligns well with the semantics of 3D scenes. This method contrasts with previous approaches that often use masking only as a preliminary step before task-specific fine-tuning.
Experimental Validation: Extensive experiments on standard datasets (ScanNet, ScanNet200, S3DIS) demonstrate that the proposed self-supervised model significantly narrows the performance gap with its supervised counterparts. The results also show that the MSM method exceeds previous self-supervised approaches, affirming its efficacy in capturing semantically rich features.

Detailed Findings

Hierarchical Feature Extraction: The paper reveals that using hierarchical feature extraction significantly enhances the performance of self-supervised 3D models compared to using only the final layer's output. This finding is substantiated through a pilot paper, emphasizing the importance of considering multiple layers during feature extraction.
Robust Semantic Segmentation: On tasks like semantic segmentation, the model achieves results that are comparable to fully supervised models, thus showcasing the potential of the self-supervised approach in real-world applications. The metrics indicate a substantial lead over other self-supervised frameworks by more than 30% in some benchmarks.
Versatility Across Tasks: Beyond segmentation, the model is also tested on instance segmentation and 3D visual grounding, further proving its ability to learn object-aware and contextually enriched features. The model's versatility is a strong argument for its applicability across various practical tasks in 3D scene understanding.

Implications and Future Outlook

This research offers a substantial advancement in self-supervised learning for 3D environments, paving the way for future exploration in this underdeveloped domain. The improvements shown in this paper could lead to broader applications, enabling efficient and scalable solutions in robotics, augmented reality, and autonomous systems where 3D data is prolific.

The next steps could involve expanding this framework to handle more diverse datasets, potentially across different domains such as outdoor scenes or non-urban environments. Additionally, integrating this method with existing 2D-3D synergistic learning models might also be an interesting avenue to explore, combining the strengths of both modalities for enhanced scene understanding.

In conclusion, this paper addresses a critical gap in 3D scene understanding, providing a robust solution that can effectively leverage the benefits of self-supervised learning, challenging the traditional reliance on large labeled datasets. This work not only advances state-of-the-art methodologies in 3D learning but also sets a new direction for future research endeavors in this field.

Related Papers

Find Related Papers

GitHub

GitHub - phermosilla/msm: Official repostory of the paper: Masked Scene Modeling (CVPR 2025) (3 stars)

Tweets

https://twitter.com/arxivsanitybot/status/1910523841941823653

YouTube

Show All Videos