DFANet: Deep Feature Aggregation for Real-Time Semantic Segmentation
The paper presents DFANet, a convolutional neural network (CNN) architecture designed for real-time semantic segmentation under resource constraints. The proposed method addresses the demand for efficient inference speed and high accuracy with high-resolution images, which is critical for applications like autonomous driving and robot sensing.
Core Contributions
DFANet introduces an efficient architecture leveraging a lightweight backbone and innovative feature aggregation methodologies. The core contributions of DFANet are as follows:
- Substantial Reduction in Computational Complexity: DFANet utilizes 8× fewer FLOPs and is 2× faster than existing state-of-the-art real-time segmentation approaches. It accomplishes this while maintaining comparative accuracy levels, with 70.3% Mean Intersection over Union (Mean IOU) on the Cityscapes test dataset at only 1.7 GFLOPs.
- Innovative Feature Aggregation: The network employs two novel feature aggregation strategies:
- Sub-network Aggregation: This method refines prediction results by reusing high-level features across different network components.
- Sub-stage Aggregation: By integrating features within corresponding stages across sub-networks, DFANet enhances feature representation, balancing high-level contextual understanding and low-level spatial detail retention.
- Modification of Xception for Efficiency: DFANet modifies the Xception network, incorporating depthwise separable convolutions and a fully-connected attention module to enhance receptive fields with minimal additional computation.
The architecture of DFANet comprises a lightweight backbone and cascades of sub-networks and sub-stages, allowing effective feature aggregation to maximize the usage of multi-scale receptive fields.
Experimental Evaluation
Experiments on the Cityscapes and CamVid datasets illustrate DFANet's superior performance, particularly in scenarios that demand real-time processing:
- Cityscapes Dataset: Achieving 71.3% Mean IOU with 3.4 GFLOPs and a speed of 100 FPS on a Titan X card, DFANet establishes a high standard for speed-accuracy trade-offs in real-time semantic segmentation.
- CamVid Dataset: Conforming to high-resolution image processing requirements, DFANet exhibits significant speed advantages with only slight reductions in segmentation accuracy.
Implications and Future Work
DFANet's architecture paves the way for advanced real-time segmentation solutions by demonstrating that high-level feature aggregation and efficient network designs can coexist with resource constraints. The approach of integrating multi-stage and multi-network features could be expanded to other areas of computer vision and applied to more complex tasks requiring real-time processing.
In future developments, further exploration into adaptive feature aggregation techniques and experimentation with diverse backbone networks could extend DFANet's application breadth. Additionally, focusing on optimizing the network for various hardware architectures may open broader real-time deployment possibilities. The proposed DFANet establishes a framework that balances computational constraints with practical application needs, contributing significantly to the domain of efficient semantic segmentation.