Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation (1606.02147v1)

Published 7 Jun 2016 in cs.CV

Abstract: The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18$\times$ faster, requires 75$\times$ less FLOPs, has 79$\times$ less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.

Citations (1,960)

Summary

  • The paper introduces ENet, a lightweight neural network that achieves real-time semantic segmentation on resource-constrained devices.
  • The model utilizes early downsampling, asymmetric and dilated convolutions, and bottleneck modules to drastically cut computational cost and memory use.
  • ENet attains competitive accuracy with up to 21 fps on NVIDIA Jetson TX1 and fits within 0.7 MB, making it ideal for mobile applications.

Overview of ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

The paper "ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation" by Adam Paszke and collaborators introduces a novel neural network architecture, ENet, tailored for pixel-wise semantic segmentation in real-time, specifically designed for applications in resource-constrained environments like mobile devices.

Introduction and Background

The rising demand for real-time semantic segmentation in mobile applications such as augmented reality, robotics, and autonomous driving necessitates efficient deep learning models. Traditional Convolutional Neural Networks (CNNs) used for segmentation, such as VGG16 or SegNet, possess substantial computational footprints and parameter counts, which hinder real-time performance on low-power devices.

ENet is proposed as a solution to this problem, leveraging recent advances in CNN architectures and optimization techniques to achieve significant reductions in computational complexity and memory usage while maintaining competitive accuracy.

Network Design

ENet is an encoder-decoder network specifically optimized for speed and efficiency. The architecture is meticulously designed to minimize floating point operations (FLOPs) and parameters. Key features of ENet include:

  • Early Downsampling: The initial network blocks greatly reduce input size, capitalizing on the spatial redundancy in visual information. This choice enables early layers to act primarily as feature extractors and improves computational efficiency.
  • Asymmetric Convolutions and Dilated Convolutions: These are utilized to maintain a wide receptive field without increasing the computational load drastically. Asymmetric convolutions help in enlarging the receptive field efficiently, while dilated convolutions further expand it without additional downsampling, preserving spatial resolution.
  • Bottleneck Modules: Inspired by ResNets, ENet employs bottleneck modules for dimensionality reduction within each layer. These modules contain three convolutions: a projection, a main convolution, and an expansion, interspersed with batch normalization and PReLU activations.
  • Small Decoder: ENet employs a compact decoder that upsamples the low-resolution feature maps produced by the encoder, refining the segmentation map with minimal computational overhead.

Results and Performance

ENet's performance is rigorously evaluated on three prominent datasets: CamVid, Cityscapes, and SUN RGB-D. It demonstrates competitive accuracy while significantly outperforming traditional models in terms of inference speed and resource efficiency.

  • Inference Time: ENet achieves substantial speedups, reporting up to 21 frames per second (fps) on NVIDIA Jetson TX1 for input resolutions suitable for automotive applications. On NVIDIA Titan X, ENet processes images at over 100 fps.
  • Efficiency: The model drastically reduces FLOPs and parameter counts by approximately two orders of magnitude compared to SegNet. ENet's entire parameter set fits into 0.7 MB, making it suitable for deployment on devices with limited computational resources.

Comparative Analysis

ENet is compared against SegNet, a well-known model in the semantic segmentation field. On the Cityscapes dataset, ENet achieves a class Intersection-over-Union (IoU) of 58.3%, outperforming SegNet’s 56.1%. Similarly, for the SUN RGB-D dataset, ENet achieves respectable class average accuracy, demonstrating its generalizability across different environments and scenarios.

Implications and Future Directions

The development of ENet has broad implications for the deployment of deep learning models in embedded systems and real-time applications. By substantially reducing computational and memory requirements, ENet enables real-time semantic segmentation on mobile devices, promoting advancements in areas like autonomous driving and augmented reality.

Future research could explore further optimizations, such as applying model compression techniques to reduce memory usage even more or leveraging emerging hardware advancements for even faster inference. Additionally, extending ENet’s architecture to handle multi-modal inputs (e.g., RGB-D data) could enhance its applicability in varied real-world scenarios.

Conclusion

ENet represents a significant stride towards achieving real-time semantic segmentation in resource-constrained environments. By innovatively combining early downsampling, asymmetric and dilated convolutions, and a streamlined encoder-decoder architecture, ENet sets a new benchmark for efficient and effective semantic segmentation, facilitating practical applications across diverse domains.

Youtube Logo Streamline Icon: https://streamlinehq.com