Involution: Inverting the Inherence of Convolution for Visual Recognition (2103.06255v2)

Published 10 Mar 2021 in cs.CV

Abstract: Convolution has been the core ingredient of modern neural networks, triggering the surge of deep learning in vision. In this work, we rethink the inherent principles of standard convolution for vision tasks, specifically spatial-agnostic and channel-specific. Instead, we present a novel atomic operation for deep neural networks by inverting the aforementioned design principles of convolution, coined as involution. We additionally demystify the recent popular self-attention operator and subsume it into our involution family as an over-complicated instantiation. The proposed involution operator could be leveraged as fundamental bricks to build the new generation of neural networks for visual recognition, powering different deep learning models on several prevalent benchmarks, including ImageNet classification, COCO detection and segmentation, together with Cityscapes segmentation. Our involution-based models improve the performance of convolutional baselines using ResNet-50 by up to 1.6% top-1 accuracy, 2.5% and 2.4% bounding box AP, and 4.7% mean IoU absolutely while compressing the computational cost to 66%, 65%, 72%, and 57% on the above benchmarks, respectively. Code and pre-trained models for all the tasks are available at https://github.com/d-li14/involution.

Citations (271)

View on Semantic Scholar

Summary

The paper introduces involution to invert traditional convolution properties, enabling dynamic spatial dependency modeling.
It employs a spatial-specific, channel-agnostic design with dynamic kernel generation, leading to up to 1.6% higher top-1 accuracy and 34.1% less computation on ImageNet.
Experimental results on COCO and Cityscapes validate its efficiency, marking significant gains in object detection and segmentation tasks.

Involution: Inverting the Inherence of Convolution for Visual Recognition

This paper introduces the concept of "involution," a novel approach designed to address limitations inherent in dynamic, convolution-based operations within deep neural networks used for visual tasks. Originating from a critical rethinking of convolution's spatial-agnostic and channel-specific properties, involution seeks to invert these principles, offering a different paradigm for visual representation learning.

Core Contributions and Methodology

The involution operator is presented as a fundamental component to build new neural network models, challenging conventional convolution's spatial-agnostic nature. Unlike standard convolutions that maintain fixed operations across the spatial domain and vary by channel, involution is spatially-specific but channel-agnostic. This means that involution adapts to varying spatial contexts but maintains consistent behavior across channels, providing a broader perspective for contextual interpretation. This transformation allows networks using involution to capture more intricate spatial dependencies efficiently.

Key technical details include:

Spatial-Specific and Channel-Agnostic Design: By tailoring involution kernels to specific spatial locations, the approach facilitates modeling of dynamic spatial interactions while sharing parameters across channels.
Efficient Kernel Generation: Instead of using a static kernel across inputs, the involution kernel is generated dynamically based on input features, which promotes adaptability and efficiency.
Implementation: Building on the ResNet architecture, RedNet is introduced as a series of models utilizing involution. RedNet replaces traditional convolution with involution at critical points, promising a favorable trade-off between accuracy and computational cost.

Experimental Results

The experimental validation across several benchmarks shows superior performance with reduced computational costs. Specifically:

On ImageNet, the RedNet-50 model outperforms the ResNet-50 by achieving up to 1.6% higher top-1 accuracy with 34.1% less computational cost.
For object detection and segmentation tasks on COCO and Cityscapes datasets, RedNet models showed significant improvements over their convolutional counterparts, achieving notable gains in key performance metrics such as bounding box AP and mean IoU.
Involution demonstrated a marked capability in segmenting large objects, attributed to its ability to perform extended spatial interactions.

Theoretical and Practical Implications

The introduction of involution not only challenges existing frameworks dominated by convolution but also proposes a more flexible alternative that can potentially unify principles from self-attention mechanisms and convolution. It redefines the approach to designing neural network architectures with a focus on dynamic spatial specificity.

Practical Implications: The efficiency gains suggest potential scaling advantages in deploying models in resource-constrained environments or on large datasets.

Theoretical Potential: Involution opens new research avenues for optimizing feature extraction processes while maintaining high adaptability to varying spatial contexts. Future work could explore the integration of involution into broader neural architecture search mechanisms, potentially leading to the discovery of more refined models across different domains.

The work serves as a meaningful step towards re-evaluating foundational assumptions in deep learning architecture, particularly for vision-based tasks, and offers a perspective that intersects dynamically parameterized operations and spatial specificity.

Involution represents a vehicle for advancing the design of neural networks beyond the conventional convolutional frameworks to embrace spatially-aware, efficient processing, setting the stage for future explorations in this vibrant area of research.

PDF Markdown

Related Papers

GitHub

GitHub - d-li14/involution: [CVPR 2021] Involution: Inverting the Inherence of Convolution for Visual Recognition, a brand new neural operator (1,312 stars)

YouTube

Show All Videos