Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation (1808.00769v2)

Published 2 Aug 2018 in cs.CV

Abstract: Convolutional neural networks are designed for dense data, but vision data is often sparse (stereo depth, point clouds, pen stroke, etc.). We present a method to handle sparse depth data with optional dense RGB, and accomplish depth completion and semantic segmentation changing only the last layer. Our proposal efficiently learns sparse features without the need of an additional validity mask. We show how to ensure network robustness to varying input sparsities. Our method even works with densities as low as 0.8% (8 layer lidar), and outperforms all published state-of-the-art on the Kitti depth completion benchmark.

Citations (256)

View on Semantic Scholar

Summary

The paper proposes a CNN model that bypasses traditional validity masks to efficiently process sparse LiDAR and optional RGB inputs.
It employs a dynamic training strategy with variable input sparsity, outperforming existing methods on the Kitti Depth Completion Benchmark.
The late fusion of modality-specific features enhances semantic segmentation, demonstrating strong potential for applications in autonomous systems.

Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation

In the paper "Sparse and Dense Data with CNNs: Depth Completion and Semantic Segmentation," the authors address the challenges of processing sparse data in computer vision tasks using Convolutional Neural Networks (CNNs), which are traditionally optimized for dense data inputs. Recognizing the prevalence of sparsity in vision data, such as Lidar and stereo depth maps, the authors propose a novel method that handles sparse depth data effectively, with the optional integration of dense RGB data. The method focuses on tasks such as depth completion and semantic segmentation, demonstrating improvements over existing techniques by relying on a strategic manipulation of network architecture and training procedures.

Overview of Methodology

The core contribution of the paper is the development of a modified neural network architecture based on NASNet, optimized to deal with sparse input data without requiring a validity mask. This approach is initially validated on the tasks of depth completion, where the network learns to predict dense depth maps from sparse LiDAR inputs, and semantic segmentation, where sparse data is incorporated to enhance segmentation performance.

A pivotal revelation from their paper is the omission of the validity mask traditionally used to signal available data points in sparse inputs. Their analysis suggests that such masks lead to inefficiency and overgeneralized output, as the network loses valuable spatial information about sparsity distribution over subsequent layers. Instead, they propose a robust training strategy with variable input sparities, improving the network's ability to generalize across different input densities without rigidity to a specific training sparsity.

Key Findings and Numerical Results

The authors report significant performance gains on depth completion tasks by outperforming existing methods on the Kitti Depth Completion Benchmark with their approach. Through an ablation paper simulating various LiDAR layer configurations (e.g., 8, 16, 32 layers), their method maintains robust depth predictions even with drastically reduced inputs, evidencing its resilience and practical applicability.

Moreover, their approach to fusing RGB and sparse depth data—referred to as "late fusion"—proves advantageous compared to traditional early fusion methods. The late fusion strategy shows superior results by first extracting modality-specific features before integrating them, thus facilitating more effective utilization of the distinct characteristics of each data type.

When applied to the semantic segmentation task, the method demonstrates that integrating sparse depth data with RGB can yield noticeable improvements over RGB-only baselines. This lays the groundwork for the potential inclusion of additional modalities in tasks that have historically relied upon singular data sources.

Implications and Future Directions

This research has substantial implications for the field of autonomous systems and robotics, where sensors often provide diverse and sparse data. The demonstrated capabilities of the network to handle sparse data effectively and enhance data fusion processes could have far-reaching impacts on various applications, such as autonomous navigation, 3D object detection, and real-time environmental mapping.

The findings suggest promising future directions involving the expansion of this method to other vision tasks and sensor types. Particularly, exploring architectures optimized for different sensor modalities and densities can pave the way toward more comprehensive and unified approaches to sensor data fusion in computer vision. Furthermore, the method’s extension to incorporate additional data sources and its potential utility in real-world scenarios with variable input characteristics remain significant areas for future research and practical deployment.

PDF Markdown