Dynamic Convolution: Attention over Convolution Kernels (1912.03458v2)

Published 7 Dec 2019 in cs.CV

Abstract: Light-weight convolutional neural networks (CNNs) suffer performance degradation as their low computational budgets constrain both the depth (number of convolution layers) and the width (number of channels) of CNNs, resulting in limited representation capability. To address this issue, we present Dynamic Convolution, a new design that increases model complexity without increasing the network depth or width. Instead of using a single convolution kernel per layer, dynamic convolution aggregates multiple parallel convolution kernels dynamically based upon their attentions, which are input dependent. Assembling multiple kernels is not only computationally efficient due to the small kernel size, but also has more representation power since these kernels are aggregated in a non-linear way via attention. By simply using dynamic convolution for the state-of-the-art architecture MobileNetV3-Small, the top-1 accuracy of ImageNet classification is boosted by 2.9% with only 4% additional FLOPs and 2.9 AP gain is achieved on COCO keypoint detection.

Citations (762)

View on Semantic Scholar

Summary

The paper introduces dynamic convolution, which aggregates multiple kernels via input-dependent attention to boost CNN representation without widening the network.
The method significantly enhances performance, as evidenced by a 2.9% top-1 ImageNet accuracy boost in MobileNetV3-Small with just a 4% FLOPs increase.
Dynamic convolution offers a scalable design for efficient CNNs, enabling advanced model performance on mobile and resource-constrained devices.

Dynamic Convolution: Attention over Convolution Kernels

The paper "Dynamic Convolution: Attention over Convolution Kernels" by Yinpeng Chen et al. presents an innovative approach to enhancing the performance of lightweight convolutional neural networks (CNNs). Lightweight CNNs typically experience performance degradation due to constraints on their depth (number of convolution layers) and width (number of channels), which limits their representation capabilities. This paper introduces dynamic convolution as a method to increase model complexity and representation power without increasing the network’s depth or width.

In dynamic convolution, instead of using a single convolution kernel per layer, the architecture aggregates multiple convolution kernels dynamically based on input-dependent attention mechanisms. This design leverages multiple parallel convolution kernels, which are aggregated in a non-linear fashion via attention weights that depend on the input. This approach is computationally efficient, owing to the small kernel sizes, and it offers greater representation power compared to static convolutions.

Numerical Results

The implementation of dynamic convolution in the state-of-the-art MobileNetV3 architecture demonstrates substantial improvements:

MobileNetV3-Small: The top-1 accuracy of ImageNet classification is boosted by 2.9% with only a 4% increase in FLOPs.
COCO Keypoint Detection: A gain of 2.9 AP is achieved on the challenging COCO keypoint detection task.

These improvements illustrate the efficacy of dynamic convolution in enhancing the performance of efficient CNNs without incurring a significant increase in computational cost.

Implications and Future Developments

The implications of this research are multifaceted:

Efficient Model Design: Dynamic convolution provides a way to construct more complex models under strict computational budgets, which is crucial for deploying deep learning models on mobile devices and other resource-constrained environments.
Transferability: The dynamic convolution method can be easily integrated into existing architectures, making it a versatile tool for improving various models.
Theoretical Insights: The dynamic aggregation of convolution kernels adds a new dimension to our understanding of how CNNs can adaptively enhance their learning capacities based on input complexity.

As for future developments, one can speculate that the principles behind dynamic convolution may be extended to other types of operations within neural networks. For instance, dynamic attention mechanisms could be explored within recurrent neural networks (RNNs) or transformer models, further enhancing the adaptive capabilities of these architectures.

Conclusion

In conclusion, dynamic convolution represents a significant step forward in the design of efficient and high-performance CNNs. By increasing the representation capacity through the non-linear aggregation of multiple convolution kernels, dynamic convolution manages to achieve substantial performance improvements with minimal additional computational overhead. This method holds promise for a wide range of applications, particularly those requiring the deployment of sophisticated deep learning models on devices with limited computational resources. The paper offers a compelling argument for the adoption of dynamic convolution as a standard component in the design of future CNN architectures.

PDF Markdown