DiCENet: Dimension-wise Convolutions for Efficient Networks

Published 8 Jun 2019 in cs.CV, cs.LG, and eess.IV | (1906.03516v3)

Abstract: We introduce a novel and generic convolutional unit, DiCE unit, that is built using dimension-wise convolutions and dimension-wise fusion. The dimension-wise convolutions apply light-weight convolutional filtering across each dimension of the input tensor while dimension-wise fusion efficiently combines these dimension-wise representations; allowing the DiCE unit to efficiently encode spatial and channel-wise information contained in the input tensor. The DiCE unit is simple and can be seamlessly integrated with any architecture to improve its efficiency and performance. Compared to depth-wise separable convolutions, the DiCE unit shows significant improvements across different architectures. When DiCE units are stacked to build the DiCENet model, we observe significant improvements over state-of-the-art models across various computer vision tasks including image classification, object detection, and semantic segmentation. On the ImageNet dataset, the DiCENet delivers 2-4% higher accuracy than state-of-the-art manually designed models (e.g., MobileNetv2 and ShuffleNetv2). Also, DiCENet generalizes better to tasks (e.g., object detection) that are often used in resource-constrained devices in comparison to state-of-the-art separable convolution-based efficient networks, including neural search-based methods (e.g., MobileNetv3 and MixNet. Our source code in PyTorch is open-source and is available at https://github.com/sacmehta/EdgeNets/

Abstract PDF Chat (Pro)

Citations (40)

View on Semantic Scholar

Summary

The paper presents the DiCE unit, which uses dimension-wise convolutions and fusion to efficiently capture spatial and channel information.
It decomposes convolutions across dimensions, achieving 2-4% higher ImageNet accuracy and improved performance in object detection and segmentation.
The approach reduces computational FLOPs, making it ideal for resource-constrained applications and future neural architecture search explorations.

Overview of DiCENet: Dimension-wise Convolutions for Efficient Networks

The paper introduces DiCENet, an innovative approach in convolutional neural network (CNN) design, which leverages dimension-wise convolutions (DimConv) and dimension-wise fusion (DimFuse) to enhance spatial and channel-wise representation efficiency. This approach addresses the computational burden of standard convolutional layers by decomposing convolutions across input tensor dimensions, presenting a viable alternative to the well-explored depth-wise separable convolutions commonly utilized in efficient network designs.

Contributions

DiCE Unit: The central contribution lies in the DiCE unit, a convolutional mechanism that applies lightweight filtering across each dimension of the input tensor, effectively encoding spatial and channel-wise information without the heavy computational demands of point-wise operations.
DimConv and DimFuse: By extending depth-wise convolutions to every dimension, DimConv enables nuanced, dimension-centric representation learning. DimFuse further optimizes this by combining dimension-wise features efficiently, minimizing reliance on costly point-wise convolutions.
Performance and Efficiency: The integration of DiCE units into various architectural frameworks, notably MobileNet, ResNet, and ShuffleNetv2, demonstrates substantial accuracy improvements and computational efficiency across benchmarks such as ImageNet. The results highlight an impressive 2-4% accuracy increase on ImageNet over manual models like MobileNetv2 and ShuffleNetv2.
Task-Level Generalization: DiCENet shows superior task generalization in computer vision applications, notably outperforming state-of-the-art models in object detection and semantic segmentation tasks, which are critical for real-world deployment on resource-constrained devices.

Numerical Results

On the ImageNet dataset, DiCENet achieves notable predictions with 2-4% higher accuracy and efficiently supports varied tasks, such as object detection on MS-COCO with SSD, yielding a 3% increase in mean average precision over MobileNetv3 and MixNet.
In terms of FLOPs, DiCENet offers significant reductions, emphasizing the model's computational prowess in balancing accuracy and efficiency.

Implications and Future Directions

DiCENet paves the way for more efficient CNN designs by addressing the computational limitations of traditional convolutional processes. The work is especially relevant for enhancing model performance on edge devices where computational and energy resources are limited.

The successful deployment of the DiCE unit suggests promising avenues for further research, particularly in neural architecture search (NAS) strategies. Incorporating DiCE units into NAS could potentially yield new architectures with optimized performance metrics.

Moreover, exploring diverse datasets and extending experiments to other application domains can solidify the utility of DiCENet across a broader spectrum of AI challenges.

In summary, DiCENet offers a significant advancement in CNN efficiency, providing a foundation for developing faster, more accurate, and resource-conscious neural models. Its practicality and potential for integration within NAS methodologies stand to influence next-generation CNN architectures mandating efficient operations across various dimensions.