Pyramid Attention Network for Semantic Segmentation (1805.10180v3)

Published 25 May 2018 in cs.CV

Abstract: A Pyramid Attention Network(PAN) is proposed to exploit the impact of global contextual information in semantic segmentation. Different from most existing works, we combine attention mechanism and spatial pyramid to extract precise dense features for pixel labeling instead of complicated dilated convolution and artificially designed decoder networks. Specifically, we introduce a Feature Pyramid Attention module to perform spatial pyramid attention structure on high-level output and combining global pooling to learn a better feature representation, and a Global Attention Upsample module on each decoder layer to provide global context as a guidance of low-level features to select category localization details. The proposed approach achieves state-of-the-art performance on PASCAL VOC 2012 and Cityscapes benchmarks with a new record of mIoU accuracy 84.0% on PASCAL VOC 2012, while training without COCO dataset.

Authors (4)

Hanchao Li (2 papers)
Pengfei Xiong (19 papers)
Jie An (36 papers)
Lingxue Wang (1 paper)

Citations (780)

View on Semantic Scholar

Summary

Pyramid Attention Network for Semantic Segmentation

The paper "Pyramid Attention Network for Semantic Segmentation" introduces a novel approach for enhancing the performance of semantic segmentation tasks in computer vision. This work addresses significant challenges in pixel-wise classification by combining attention mechanisms with spatial pyramid structures, diverging from the traditional reliance on dilated convolutions and complex decoder networks.

Key Contributions

The proposed Pyramid Attention Network (PAN) integrates two new modules:

Feature Pyramid Attention (FPA): This module leverages a spatial pyramid structure followed by an attention mechanism to capture precise dense features for pixel labeling.
Global Attention Upsample (GAU): Serving as an effective decoder, this module enhances low-level feature representation with global context from high-level features, thereby improving pixel localization.

Methodology

The PAN architecture is predicated on the premise that existing models struggle with accurately classifying objects at multiple scales and maintaining spatial resolution. The approach includes:

Feature Pyramid Attention (FPA): FPA merges multi-scale features using a spatial pyramid structure with convolution operations of varying kernel sizes ( $3 \times 3$ , $5 \times 5$ , and $7 \times 7$ ). This design circumvents the loss of pixel-level localization inherent in pooling-based methods like PSPNet and ASPP. The module incorporates global pooling to augment channel-wise feature attention.
Global Attention Upsample (GAU): GAU replaces complex U-shaped decoder networks with a simpler yet effective upsampling mechanism. It utilizes high-level semantic features to recalibrate low-level detail-rich features through multiplicative attention guidance and $3 \times 3$ convolutions, reducing computational burden while enhancing performance.

Experimental Results

PAN establishes new benchmarks on PASCAL VOC 2012 and Cityscapes datasets without pre-training on the larger COCO dataset, achieving notable improvements:

PASCAL VOC 2012: PAN reaches a mean Intersection-over-Union (mIoU) of 84.0%, outperforming previous state-of-the-art methods such as PSPNet and DeepLabv3.
Cityscapes: The network achieves a high mIoU of 78.6%, affirming its robustness and efficacy in urban scene understanding.

Evaluation and Comparisons

A thorough ablation paper demonstrates the advantages of the proposed modules:

FPA Module: It consistently enhances pixel-wise classification accuracy. For instance, configurations using average pooling (AVE) outperform those using max pooling.
GAU Module: Incorporating global context significantly bolsters performance, confirming that high-level semantic guidance is pivotal in resolving fine-grained details.

Implications and Future Directions

The PAN framework's robust performance implies substantial practical applications in real-time segmentation tasks, such as autonomous driving and medical imaging, where precise pixel-level classification is critical. The emphasis on efficient computation (avoiding dilated convolutions) and effective decoders positions PAN as a viable option for resource-constrained environments.

Moving forward, future research could explore the integration of PAN with transformer-based architectures, which might further ameliorate its ability to capture global dependencies. Additionally, extending the evaluation to diverse datasets beyond PASCAL VOC and Cityscapes would ensure broader applicability and performance validation across various contexts.

In summary, the Pyramid Attention Network marks a significant advancement in semantic segmentation, demonstrating the potential of combining attention mechanisms with spatial pyramid structures to achieve superior performance.

PDF Markdown

Related Papers

Find Related Papers