Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation (2203.11160v2)

Published 21 Mar 2022 in cs.CV

Abstract: This work investigates learning pixel-wise semantic image segmentation in urban scenes without any manual annotation, just from the raw non-curated data collected by cars which, equipped with cameras and LiDAR sensors, drive around a city. Our contributions are threefold. First, we propose a novel method for cross-modal unsupervised learning of semantic image segmentation by leveraging synchronized LiDAR and image data. The key ingredient of our method is the use of an object proposal module that analyzes the LiDAR point cloud to obtain proposals for spatially consistent objects. Second, we show that these 3D object proposals can be aligned with the input images and reliably clustered into semantically meaningful pseudo-classes. Finally, we develop a cross-modal distillation approach that leverages image data partially annotated with the resulting pseudo-classes to train a transformer-based model for image semantic segmentation. We show the generalization capabilities of our method by testing on four different testing datasets (Cityscapes, Dark Zurich, Nighttime Driving and ACDC) without any finetuning, and demonstrate significant improvements compared to the current state of the art on this problem. See project webpage https://vobecant.github.io/DriveAndSegment/ for the code and more.

PDF Abstract

Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation

This paper presents an innovative approach to semantic segmentation in urban environments without relying on manually annotated data, explicitly leveraging the strengths of both image and LiDAR data modalities. The method, called Drive{content}Segment, addresses the challenges of unsupervised semantic segmentation through a threefold contribution: using cross-modal data for learning, aligning and clustering 3D object proposals into pseudo-classes, and training a transformer-based model with a cross-modal distillation process.

Core Contributions

Cross-modal Data Utilization: The paper introduces a method for unsupervised learning that combines synchronized LiDAR and image data, crucially including an object proposal module that exploits LiDAR point clouds. This approach allows the segmentation of spatially coherent objects, which are then projected onto image data for further analysis.
Object Alignment and Clustering: By aligning 3D proposals with images, the paper showcases an effective clustering strategy to create semantically coherent pseudo-classes. This step is achieved with the use of unsupervised image features, allowing the segmentation model to recognize a variety of objects without manual labeling—addressing even complex urban environments with challenging objects like pedestrians and traffic lights.
Cross-modal Distillation Training: The use of a teacher-student model setup capitalizes on these pseudo-classes, refining the segmentation model's capabilities. Through geometric constraints extracted from LiDAR, this distillation process enhances the model's predictions, achieving superior generalization across different datasets.

Evaluations and Results

The method is rigorously tested across multiple datasets, including Cityscapes, Dark Zurich, Nighttime Driving, and ACDC, demonstrating marked improvements over existing unsupervised methods. Notably, Drive{content}Segment increases mIoU from 15.8 to 21.8 on the Cityscapes dataset and from 4.6 to 14.2 on Dark Zurich, showcasing a robust learning capability from uncurated real-world data.

Implications and Future Directions

The research underscores the potential of integrating cross-modal data in unsupervised settings, advancing the field of autonomous driving perception systems without the heavy reliance on annotated datasets. This paper effectively addresses the critical needs for scalable and less biased training methods in real-world applications. While primarily applied to urban driving environments, this approach can spearhead future research in unsupervised segmentation across various robotics and AI domains. The potential of using other sensor modalities alongside LiDAR data invites further exploration, potentially leading to even more efficient perception systems.

In conclusion, this research offers a significant step towards more autonomous and scalable AI models for real-world environments, paving the way for further developments in the unsupervised learning domain. The method's success in challenging conditions reinforces its applicability for diverse datasets and conditions, thereby broadening its practical implementation scope.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Antonin Vobecky (9 papers)
David Hurych (12 papers)
Oriane Siméoni (19 papers)
Spyros Gidaris (34 papers)
Andrei Bursuc (55 papers)
Josef Sivic (78 papers)
Patrick Pérez (90 papers)

Citations (19)

View on Semantic Scholar

Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation (2203.11160v2)

Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation

Core Contributions

Evaluations and Results

Implications and Future Directions

Related Papers

GitHub

YouTube