ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer (2203.03952v5)

Published 8 Mar 2022 in cs.CV

Abstract: Recently, vision transformers started to show impressive results which outperform large convolution based models significantly. However, in the area of small models for mobile or resource constrained devices, ConvNet still has its own advantages in both performance and model complexity. We propose ParC-Net, a pure ConvNet based backbone model that further strengthens these advantages by fusing the merits of vision transformers into ConvNets. Specifically, we propose position aware circular convolution (ParC), a light-weight convolution op which boasts a global receptive field while producing location sensitive features as in local convolutions. We combine the ParCs and squeeze-exictation ops to form a meta-former like model block, which further has the attention mechanism like transformers. The aforementioned block can be used in plug-and-play manner to replace relevant blocks in ConvNets or transformers. Experiment results show that the proposed ParC-Net achieves better performance than popular light-weight ConvNets and vision transformer based models in common vision tasks and datasets, while having fewer parameters and faster inference speed. For classification on ImageNet-1k, ParC-Net achieves 78.6% top-1 accuracy with about 5.0 million parameters, saving 11% parameters and 13% computational cost but gaining 0.2% higher accuracy and 23% faster inference speed (on ARM based Rockchip RK3288) compared with MobileViT, and uses only 0.5 times parameters but gaining 2.7% accuracy compared with DeIT. On MS-COCO object detection and PASCAL VOC segmentation tasks, ParC-Net also shows better performance. Source code is available at https://github.com/hkzhang91/ParC-Net

Citations (49)

View on Semantic Scholar

Summary

The paper introduces Position Aware Circular Convolution to efficiently capture both local and global features in vision tasks.
The methodology integrates channel-wise attention in a ParC block to merge ConvNet efficiency with Transformer-style attention, achieving 78.6% top-1 accuracy on ImageNet with reduced model size.
Experimental evaluations highlight ParC-Net’s advantages in object detection and segmentation, proving its efficiency for resource-constrained devices.

Overview of "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer"

The paper "ParC-Net: Position Aware Circular Convolution with Merits from ConvNets and Transformer" presents a novel model architecture that synthesizes characteristics of Convolutional Neural Networks (ConvNets) and Vision Transformers (ViTs) to enhance performance on resource-constrained devices. The paper acknowledges the current landscape where ViTs have achieved impressive outcomes in large-scale vision tasks, yet traditional ConvNets remain advantageous for small models due to their efficiency and ease of training.

ParC-Net Architecture

Position Aware Circular Convolution (ParC):

The core innovation in this work is the introduction of Position Aware Circular Convolutions (ParC), a lightweight convolution operation designed to maintain a global receptive field while preserving spatial sensitivity typical of local convolutions. ParC incorporates base-instance kernels and position embedding strategies to dynamically adjust to input variations, providing an efficient means to grasp both local and global features without the computational overhead associated with traditional self-attention mechanisms.

ParC Block:

The ParC block builds upon the ParC operation and includes a channel-wise attention module, forming a meta-former-like structure. This block structure aligns with the design principles of ViTs but retains the efficient computational profile of ConvNets. This block is versatile, offering a drop-in replacement for components in existing ConvNet and transformer architectures.

Experimental Evaluation

The paper reports that ParC-Net demonstrates superior performance on various computer vision benchmarks compared to other lightweight architectures. Specifically:

Image Classification: On the ImageNet-1k dataset, ParC-Net achieves a 78.6% top-1 accuracy with approximately 5.0 million parameters. It outperforms MobileViT by gaining a 0.2% higher accuracy while reducing parameter count by 11% and decreasing computational cost by 13%.
Object Detection and Segmentation: Performance on MS-COCO and PASCAL VOC showcases better mean Average Precision (mAP) and mean Intersection over Union (mIOU) compared to other models, with ParC-Net maintaining a significant advantage in model size and inference speed.

Implications and Future Directions

ParC-Net offers an intriguing approach to designing lightweight models for mobile and edge computing applications. By effectively unifying the strengths of ConvNets' efficient convolution operations and ViTs' attention mechanisms, ParC-Net reflects a promising direction in deploying deep learning models in resource-constrained environments.

There is potential for future exploration in extending ParC-Net to broader tasks beyond vision, considering its adaptable architecture. Moreover, the model's practical compatibility with existing hardware accelerators hints at improved deployment efficiencies, suggesting opportunities in various real-world applications such as autonomous vehicles, IoT devices, and embedded systems.

Conclusion

This paper presents a noteworthy contribution to the ongoing development of efficient neural network designs. By successfully integrating pivotal elements from both ConvNets and ViTs, ParC-Net provides a valuable framework for optimizing performance in mobile and resource-limited scenarios. The promising results and methodology offer a foundation for further innovations that could reshape lightweight model architectures across diverse applications.

PDF Markdown

Related Papers

GitHub

GitHub - hkzhang-git/ParC-Net: [ECCV 2022] Source code of "EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers" (348 stars)