Global Aggregation then Local Distribution in Fully Convolutional Networks (1909.07229v1)

Published 16 Sep 2019 in cs.CV

Abstract: It has been widely proven that modelling long-range dependencies in fully convolutional networks (FCNs) via global aggregation modules is critical for complex scene understanding tasks such as semantic segmentation and object detection. However, global aggregation is often dominated by features of large patterns and tends to oversmooth regions that contain small patterns (e.g., boundaries and small objects). To resolve this problem, we propose to first use \emph{Global Aggregation} and then \emph{Local Distribution}, which is called GALD, where long-range dependencies are more confidently used inside large pattern regions and vice versa. The size of each pattern at each position is estimated in the network as a per-channel mask map. GALD is end-to-end trainable and can be easily plugged into existing FCNs with various global aggregation modules for a wide range of vision tasks, and consistently improves the performance of state-of-the-art object detection and instance segmentation approaches. In particular, GALD used in semantic segmentation achieves new state-of-the-art performance on Cityscapes test set with mIoU 83.3\%. Code is available at: \url{https://github.com/lxtGH/GALD-Net}

Citations (64)

View on Semantic Scholar

Summary

The paper introduces GALD, a framework that rebalances global and local feature integration in FCNs to better preserve small patterns and object boundaries.
It leverages per-channel mask maps via depth-wise convolutions in the local distribution module to adaptively modulate global feature influence based on local context.
Experimental results demonstrate that GALD boosts scene understanding performance, achieving 83.3% mIoU on Cityscapes while enhancing object detection and segmentation on multiple datasets.

Overview of "Global Aggregation then Local Distribution in Fully Convolutional Networks"

The paper "Global Aggregation then Local Distribution in Fully Convolutional Networks" addresses challenges inherent in scene understanding tasks by refining feature aggregation strategies employed in Fully Convolutional Networks (FCNs). The authors propose a novel mechanism called Global Aggregation then Local Distribution (GALD), which optimizes the utilization of long-range dependencies by conditionally integrating global information based on local pattern characteristics.

Key Contributions

The core proposition of the paper is the GALD framework, which combines Global Aggregation (GA) with a Local Distribution (LD) module. GA modules traditionally aid in collecting long-range features but suffer from over-smoothing, especially affecting small patterns, boundaries, and objects. GALD addresses this by re-distributing global features in a manner sensitive to the local pattern context, essentially leveraging the strengths of GA modules while mitigating their weaknesses. This is achieved using per-channel mask maps that adaptively adjust the influence of global features based on the estimated size and significance of each pattern at any given position.

Technical Insights

Global Aggregation: GALD utilizes existing GA modules such as PSP, ASPP, Non-Local, and CGNL to gather comprehensive scene information. These modules extend the receptive field over large areas of the image.
Local Distribution: Introduced as a novel step post-global aggregation, the LD module employs depth-wise convolutions to predict mask maps that guide how global features are applied at each spatial location. This allows more granular control over feature application, ensuring that detailed regions such as object boundaries maintain their critical information.
Integration and Performance: GALD is end-to-end trainable and can be integrated with existing FCNs, providing improvements in scene understanding benchmarks. It demonstrates significant performance gains across various tasks including semantic segmentation and object detection. In particular, it achieves a state-of-the-art performance of 83.3% mIoU on the Cityscapes dataset.

Experimental Evaluation

The experiments validate GALD's efficiency across different datasets:

For semantic segmentation on Cityscapes, GALD achieves promising mIoU scores, surpassing other leading methods. By intelligently distributing global aggregation results, GALD refines predictions, particularly in regions with small or intricate patterns, which are traditionally challenging for FCNs.
For object detection and instance segmentation on datasets like Pascal VOC and MS COCO, GALD consistently enhances the baseline models, demonstrating its applicability and effectiveness across diverse computer vision tasks.

Implications and Future Directions

The paper suggests that combining global and local information in a dynamic and adaptable manner can significantly improve the granularity and accuracy of predictions in FCNs. The adaptability of GALD highlights a promising direction for further research, including exploration of more sophisticated mask generation techniques or integration into other network architectures.

Future work could explore extending GALD to additional vision tasks such as depth estimation or 3D scene reconstruction, wherein both global structure and local detail are crucial. The paper underlines the importance of balancing fine-grained local features with global contextual information for improved scene understanding in complex visual environments.

Overall, this research contributes valuable insights into the design of feature aggregation mechanisms and sets a framework for future developments aimed at enhancing the efficacy of deep learning models for scene understanding.

PDF Markdown

Related Papers

GitHub

GitHub - lxtGH/GALD-DGCNet: Source code and model GALD net (BMVC-2019) and Dual-Seg Net (BMVC-2019) (343 stars)