GFF: Gated Fully Fusion for Semantic Segmentation (1904.01803v2)

Published 3 Apr 2019 in cs.CV

Abstract: Semantic segmentation generates comprehensive understanding of scenes through densely predicting the category for each pixel. High-level features from Deep Convolutional Neural Networks already demonstrate their effectiveness in semantic segmentation tasks, however the coarse resolution of high-level features often leads to inferior results for small/thin objects where detailed information is important. It is natural to consider importing low level features to compensate for the lost detailed information in high-level features.Unfortunately, simply combining multi-level features suffers from the semantic gap among them. In this paper, we propose a new architecture, named Gated Fully Fusion (GFF), to selectively fuse features from multiple levels using gates in a fully connected way. Specifically, features at each level are enhanced by higher-level features with stronger semantics and lower-level features with more details, and gates are used to control the propagation of useful information which significantly reduces the noises during fusion. We achieve the state of the art results on four challenging scene parsing datasets including Cityscapes, Pascal Context, COCO-stuff and ADE20K.

Citations (179)

View on Semantic Scholar

Summary

The paper introduces a gated fully fusion (GFF) mechanism that integrates multi-level CNN features for improved semantic segmentation.
It employs a gating mechanism to selectively control feature flow, preserving high-level semantics and enhancing details for small objects.
GFF achieves state-of-the-art mIoU scores on datasets like Cityscapes (82.3%), ADE20K, and COCO-stuff, demonstrating its substantial performance gains.

Gated Fully Fusion for Semantic Segmentation

This paper introduces a novel architecture, Gated Fully Fusion (GFF), aimed at improving semantic segmentation tasks by effectively integrating multi-level features from convolutional neural networks (CNNs). The primary challenge in semantic segmentation is to attain high resolution and rich semantics per pixel, especially for small or thin objects, which conventional high-level features struggle to represent accurately. The authors' approach incorporates a gating mechanism to mitigate the semantic gap typically seen when fusing multi-level features in segmentation tasks, where high-level features are rich in semantic information but often lack detail, and low-level features contain more detailed information but lack semantic richness.

The GFF architecture specifically leverages gates to selectively control the flow of useful information between different feature levels. By doing so, it reduces noise during fusion and facilitates better segmentation performance, as evidenced by achieving state-of-the-art results on datasets such as Cityscapes, Pascal Context, COCO-stuff, and ADE20K.

Methodology

The method involves generating gate maps that regulate information transfer between features from different network layers, ensuring that only relevant information is passed through. While high-level features possess semantic depth beneficial for large-scale patterns, low-level features offer crucial details necessary to accurately segment small-scale patterns and boundaries. Therefore:

Gating Mechanism: The gates are defined to control the propagation, allowing features with higher confidence (large gate values) to disseminate and those with lower confidence to receive additional information from other layers. This essentially allows a bidirectional flow of valuable features.
Dense Feature Pyramid (DFP) Module: Complementing the GFF is a DFP module that further capitalizes on contextual encoding, reusing high-level features across varying scale levels to enhance segmentation accuracy on large objects.

Results and Performance

On the Cityscapes dataset, the proposed approach demonstrated significant improvements, achieving mean Intersection over Union (mIoU) scores of 82.3% on the test set, outperforming existing methods. This improvement is particularly pronounced in categories of small/thin objects like poles and pedestrians, thanks to the effective regulation of detail-rich features from lower layers by the GFF. Similarly, the architecture showed top performance across several benchmarks, including ADE20K (45.33% mIoU with ResNet101) and COCO-stuff datasets (39.2% mIoU).

Implications and Future Directions

The practical implications of this research are substantial for applications requiring precise scene understanding, such as autonomous driving and medical imaging. The theoretical implications highlight the importance of efficient feature integration across varying network depths—paving the way for more sophisticated scene parsing methods. Future work may delve further into the refinement of gating functions or explore the integration of these concepts into other domain-specific architectures, enhancing model adaptability and accuracy in different scenarios.

The advancements depicted in this paper suggest that as neural network architectures become more complex, the management of semantic and detail information through constructs like GFF and DFP will be pivotal in achieving robust performance across varied image segmentation tasks.

PDF Markdown