Self-Supervised Visual Terrain Classification from Unsupervised Acoustic Feature Learning (1912.03227v1)

Published 6 Dec 2019 in cs.RO, cs.CV, and cs.LG

Abstract: Mobile robots operating in unknown urban environments encounter a wide range of complex terrains to which they must adapt their planned trajectory for safe and efficient navigation. Most existing approaches utilize supervised learning to classify terrains from either an exteroceptive or a proprioceptive sensor modality. However, this requires a tremendous amount of manual labeling effort for each newly encountered terrain as well as for variations of terrains caused by changing environmental conditions. In this work, we propose a novel terrain classification framework leveraging an unsupervised proprioceptive classifier that learns from vehicle-terrain interaction sounds to self-supervise an exteroceptive classifier for pixel-wise semantic segmentation of images. To this end, we first learn a discriminative embedding space for vehicle-terrain interaction sounds from triplets of audio clips formed using visual features of the corresponding terrain patches and cluster the resulting embeddings. We subsequently use these clusters to label the visual terrain patches by projecting the traversed tracks of the robot into the camera images. Finally, we use the sparsely labeled images to train our semantic segmentation network in a weakly supervised manner. We present extensive quantitative and qualitative results that demonstrate that our proprioceptive terrain classifier exceeds the state-of-the-art among unsupervised methods and our self-supervised exteroceptive semantic segmentation model achieves a comparable performance to supervised learning with manually labeled data.

Authors (3)

Jannik Zürn (8 papers)
Wolfram Burgard (149 papers)
Abhinav Valada (117 papers)

Citations (92)

View on Semantic Scholar

Summary

The paper demonstrates a self-supervised framework that leverages unsupervised acoustic feature learning to classify urban terrains.
It employs a Siamese encoder with triplet loss guided by visual similarities to embed audio features without manual labels.
The approach achieves competitive semantic segmentation accuracy, reducing annotation needs and enhancing autonomous navigation.

Overview of Self-Supervised Visual Terrain Classification from Unsupervised Acoustic Feature Learning

The paper "Self-Supervised Visual Terrain Classification from Unsupervised Acoustic Feature Learning" presents an innovative approach for mobile robots to classify terrain in urban environments using a self-supervised framework. This approach significantly reduces the need for manual data labeling, leveraging the interaction sounds between the vehicle and terrain as an automatic labeling source. The framework fuses proprioceptive and exteroceptive modalities, namely audio and visual data, to classify diverse terrains through the novel use of a self-supervised learning strategy.

Methodological Highlights

Unsupervised Proprioceptive Classification:
- The method begins with an unsupervised learning phase, where audio clips of different terrains are transformed into spectrograms and embedded into a discriminative embedding space. The embeddings are generated using a Siamese Encoder with a reconstruction loss (SE-R), which ensures robustness and discriminative capability without the need for manual labels.
Triplet Sampling using Visual Features:
- Triplets of audio data are formed based on visual similarities of terrain captured by cameras, allowing the use of triplet loss for training the encoder. This innovative approach to triplet selection leverages pre-trained visual feature extractors, overcoming the typical requirement for ground truth labels in metric learning.
Self-Supervised Semantic Segmentation:
- Subsequently, the audio embeddings are clustered, and these clusters are used to automatically label visual terrain patches. The robot projects its traversed path into camera images, using these sparse labels to train a semantic segmentation model. This model then performs pixel-wise classification of terrain in ahead images, enabling terrain-aware navigation.

Empirical Findings

The proposed framework demonstrates superior performance in unsupervised terrain classification, surpassing existing methods based on clustering accuracy and normalized mutual information.
The self-supervised semantic segmentation model achieves accuracy comparable to supervised models that require extensive manual labeling.
Variances in environmental conditions, such as lighting and weather changes, highlight the robustness of the audio classifier and the adaptability of the visual classifier through self-supervised fine-tuning.

Implications and Future Prospects

This research substantially reduces the dependency on manual annotation, paving the way for scalable deployment of autonomous robots in diverse environments.
By facilitating lifetime learning and adaptation to changing terrains, this method can significantly enhance the operational range and safety of robots during navigation.
Future developments could explore integrating more modalities, including inertial and tactile data, to further enhance terrain classification accuracy and adaptability.
Expanding this framework can foster advancements in other areas of mobile robotics, encouraging developments in fully autonomous vehicle systems where sensor fusion and self-supervised learning play central roles.

This work represents a significant advancement in robotics, showcasing how dimensional synergies between proprioceptive and exteroceptive sensing can provide robust terrain classification solutions. It aligns closely with the growing interest in self-supervised and unsupervised learning paradigms, which are crucial for achieving true autonomy in robotic systems. As AI systems continue to evolve, the methodologies presented in this paper could drive substantial progress in the field of autonomous navigation and tactical terrain interaction.

PDF Markdown

Related Papers

YouTube

Show All Videos