In Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving Images (1903.08469v2)

Published 20 Mar 2019 in cs.CV

Abstract: Recent success of semantic segmentation approaches on demanding road driving datasets has spurred interest in many related application fields. Many of these applications involve real-time prediction on mobile platforms such as cars, drones and various kinds of robots. Real-time setup is challenging due to extraordinary computational complexity involved. Many previous works address the challenge with custom lightweight architectures which decrease computational complexity by reducing depth, width and layer capacity with respect to general purpose architectures. We propose an alternative approach which achieves a significantly better performance across a wide range of computing budgets. First, we rely on a light-weight general purpose architecture as the main recognition engine. Then, we leverage light-weight upsampling with lateral connections as the most cost-effective solution to restore the prediction resolution. Finally, we propose to enlarge the receptive field by fusing shared features at multiple resolutions in a novel fashion. Experiments on several road driving datasets show a substantial advantage of the proposed approach, either with ImageNet pre-trained parameters or when we learn from scratch. Our Cityscapes test submission entitled SwiftNetRN-18 delivers 75.5% MIoU and achieves 39.9 Hz on 1024x2048 images on GTX1080Ti.

Authors (4)

Marin Oršić (12 papers)
Ivan Krešo (7 papers)
Petra Bevandić (17 papers)
Siniša Šegvić (38 papers)

Citations (322)

View on Semantic Scholar

Summary

The paper demonstrates that using pre-trained ImageNet architectures like ResNet-18 and MobileNet V2 provides an effective balance of speed and accuracy for real-time segmentation.
It introduces a lightweight decoder with ladder-style upsampling and lateral connections that significantly reduces computational complexity.
The method employs a novel resolution pyramid to enhance receptive fields, achieving 75.5% mIoU at 39.9 Hz on the Cityscapes benchmark.

Overview of the Semantic Segmentation Method for Real-time Road-driving Images

The paper "In Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving Images" presents a detailed exploration of employing pre-trained ImageNet models as effective encoders for real-time semantic segmentation, particularly of road-driving images. The focus herein is on developing methods that offer a balance between computational efficiency and prediction accuracy, utilizing existing architectures, thus avoiding the pitfalls of training custom lightweight models from scratch.

Methodology and Contributions

The paper argues for the use of pre-trained ImageNet architectures for semantic segmentation tasks, particularly highlighting their adequacy for real-time applications. The major contributions can be summarized as follows:

Adoption of Lightweight General Purpose Architecture: It proposes the use of compact ImageNet pre-trained models, specifically ResNet-18 and MobileNet V2, as encoders within the segmentation model. These models are well-suited for fine-tuning due to their pre-trained parameters and moderate depth, offering a manageable balance between computational complexity and practical utility.
Efficient Upsampling Strategies: A lightweight decoder with lateral connections is utilized. The authors emphasize ladder-style upsampling, which integrates bilinear interpolation and subsequent blending with convolutional operations, drastically reducing the complexity compared to alternative upsampling methods.
Enhancement of Receptive Fields: The paper also presents a novel resolution pyramid approach to augment the receptive field of the model, implemented by combining multi-resolution representations, allowing the model to handle varying object scales effectively without inducing excessive computational overhead or risking overfitting.
Empirical Evaluation and Results: On the Cityscapes dataset, their model achieves 75.5% mean Intersection over Union (mIoU) at a processing rate of 39.9 Hz for 1024×2048 resolution images on a GTX1080Ti, setting a significant benchmark in speed and accuracy among real-time approaches.

Implications and Future Directions

This research aligns with the ongoing trend of leveraging existing, well-optimized neural network architectures for new tasks. By demonstrating the effectiveness of pre-trained models for semantic segmentation, the work advocates for a direction in AI development that emphasizes leveraging existing resources for improved efficiency and performance.

Practically, these findings are critical for deploying intelligent visual recognition systems in resource-constrained environments, such as autonomous vehicles or mobile platforms, where the balance between speed and accuracy is vital.

Future investigations could delve into the impact of architectures beyond those tested, tailoring more specific components of these generalized models towards segmentation tasks. Moreover, expanding these techniques to other domains requiring quick response times and real-time processing might yield diverse applications.

In conclusion, the paper robustly supports the feasibility and advantages of utilizing ImageNet pre-trained models, not merely from a theoretical perspective but through quantifiable experimental results, establishing them as a competent choice for high-performance, real-time semantic segmentation tasks.

PDF Markdown

In Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving Images (1903.08469v2)

Summary

Overview of the Semantic Segmentation Method for Real-time Road-driving Images

Methodology and Contributions

Implications and Future Directions

Related Papers