Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution (1904.05049v3)

Published 10 Apr 2019 in cs.CV

Abstract: In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies, and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially "slower" at a lower spatial resolution reducing both memory and computation cost. Unlike existing multi-scale methods, OctConv is formulated as a single, generic, plug-and-play convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing convolutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memory and computational cost. An OctConv-equipped ResNet-152 can achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2 GFLOPs.

Citations (527)

View on Semantic Scholar

Summary

The paper introduces Octave Convolution, a novel technique that factorizes feature maps into high- and low-frequency components to reduce computational cost and enhance efficiency.
The method is a plug-and-play replacement for standard convolutions and achieves up to 75% reduction in FLOPs while maintaining or improving classification accuracy.
Extensive ablation studies on architectures like ResNets, MobileNets, and DenseNets validate OctConv's broad applicability and effectiveness in both image and video recognition tasks.

An Overview of "Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution"

The paper "Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution" addresses the spatial redundancy endemic in convolutional neural networks (CNNs) by introducing a novel convolution operation, termed Octave Convolution (OctConv). This approach seeks to efficiently manage different spatial frequency components within the feature maps, offering an enhancement in both computational efficiency and model accuracy.

Technical Contributions and Insights

The core idea of OctConv involves factorizing the feature maps into high-frequency and low-frequency components. This aligns with the scale-space theory, where lower frequencies capture global structures and higher frequencies represent finer details. By processing the low-frequency components at a reduced spatial resolution, the OctConv reduces both the spatial redundancy and the computational resources required.

Octave Feature Representation: The innovation lies in representing features at two different spatial frequencies. The low-frequency features are stored at a reduced resolution, which leads to decreased memory usage and computational cost. The paper meticulously details the conditions under which each operation is performed, ensuring efficient feature extraction and representation.
Implementation of OctConv: OctConv is designed as a plug-and-play replacement of the vanilla convolution, compatible with existing network architectures, including group and depth-wise convolutions. This compatibility demonstrates its potential to be a versatile building block in CNN architectures.
Efficiency and Performance: The experimental results presented in the paper affirm OctConv's efficacy. The implementation achieves a reduction in FLOPs, up to 75% in certain configurations, while improving or maintaining classification accuracy across a variety of CNN architectures. A ResNet-152 equipped with OctConv achieved 82.9% top-1 accuracy on ImageNet using only 22.2 GFLOPs.
Ablation Studies: Detailed experiments were conducted on popular CNN architectures, including ResNets, MobileNets, and DenseNets, validating the proposed method. These studies show that OctConv can effectively reduce computational cost without degrading accuracy.
Video Action Recognition: Beyond image classification, OctConv's potential was tested on video action recognition tasks. Here again, OctConv demonstrated increased accuracy and reduced computational load, underscoring its applicability to 3D CNNs.

Implications and Speculation

The proposed OctConv method provides theoretical and practical evidence that optimizing spatial frequency representation within CNNs can achieve substantial gains in efficiency and accuracy. The plug-and-play nature of OctConv suggests potential widespread application across various CNN-based tasks, possibly influencing future CNN architecture designs.

Moving forward, we might anticipate further exploration in adaptive frequency decomposition in neural networks, possibly integrating more complex frequency transformations or nonlinear mappings. Additionally, strategic combinations with other architecture optimization techniques, such as neural architecture search or automated compression methods, could yield even more efficient and powerful models.

In conclusion, the introduction of Octave Convolution represents a meaningful contribution to the field of efficient neural network design, with significant impacts on both theoretical understanding and practical application.

Related Papers

YouTube

Show All Videos