Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation (2003.07853v2)

Published 17 Mar 2020 in cs.CV and cs.LG

Abstract: Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.

Authors (6)

Huiyu Wang (38 papers)
Yukun Zhu (33 papers)
Bradley Green (20 papers)
Hartwig Adam (49 papers)
Alan Yuille (294 papers)
Liang-Chieh Chen (66 papers)

Citations (607)

View on Semantic Scholar

Summary

The paper introduces a novel axial-attention mechanism that factorizes 2D self-attention into sequential 1D attentions to extend receptive fields and reduce computation.
It demonstrates a 2.8% PQ improvement on COCO and shows that a variant is 3.8 times more parameter-efficient than previous state-of-the-art models.
The approach enables high-performing segmentation models to be deployed in resource-constrained environments like mobile and edge devices.

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

The paper "Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation" presents an innovative approach to employing axial-attention mechanisms for panoptic segmentation, offering a paradigm shift from traditional convolutional architectures towards fully attentional models. This work brings forth a method to enhance model efficiency and effectiveness in capturing long-range dependencies within visual tasks.

Core Contributions

The principal concept introduced is the axial-attention mechanism, which factorizes conventional 2D self-attention into two sequential 1D attentions across image axes. This design choice results in significant computational savings while extending receptive fields, allowing for efficient global context modeling. Furthermore, the paper introduces a position-sensitive attention design, augmenting typical attention mechanisms with the ability to capture position-dependent interactions contextually, thereby enhancing modeling capabilities for tasks such as segmentation.

The Axial-DeepLab model is thoroughly evaluated and demonstrates superior performance on several benchmark datasets including ImageNet, COCO, Mapillary Vistas, and Cityscapes. Notably, the model achieves the highest performance among stand-alone self-attention models on ImageNet, with a marked improvement of 2.8% PQ over existing bottom-up panoptic segmentation state-of-the-art models on COCO.

Numerical Performance and Implications

Numerically, the Axial-DeepLab model achieves significant reductions in computation without compromising accuracy. Specifically, a smaller model variant is reported to be 3.8 times more parameter-efficient and 27 times more computationally efficient (in terms of M-Adds) compared to the previous state-of-the-art. These efficiencies do not merely represent incremental improvements but highlight the potential to deploy state-of-the-art segmentation models in resource-constrained environments, which is a crucial practicality in real-world applications.

Theoretical and Practical Implications

The paper's findings propose substantial theoretical and practical implications. Theoretically, the adoption of axial-attention challenges the prevalent reliance on convolutional operations, suggesting a feasible alternative for future neural network architectures focused on capturing extensive context efficiently. Practically, the advancements in computational efficiency and parameter reduction open pathways for deploying high-performing computer vision models across various platforms, including mobile and edge devices, which traditionally struggle with computational constraints.

Speculation on Future Developments

Looking ahead, the results and methodology set the stage for further exploration into axial-attention-based models. Future research may delve into optimizing the computational aspects of axial-attention on different hardware setups, potentially leading to broader adoption of these models. Additionally, the integration and application of axial-attention within other domains, such as video processing or temporal sequence modeling, could leverage its strengths in contextual awareness across both spatial and temporal dimensions.

In summary, "Axial-DeepLab" introduces an impactful methodology for improving panoptic segmentation through axial-attention. Its demonstrated efficiency in both parameter usage and computation positions it as a forward-looking contribution to computer vision research, paving the way for more expansive applications in AI.

PDF Markdown

Related Papers

YouTube

Show All Videos