Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ConTNet: Why not use convolution and transformer at the same time? (2104.13497v3)

Published 27 Apr 2021 in cs.CV

Abstract: Although convolutional networks (ConvNets) have enjoyed great success in computer vision (CV), it suffers from capturing global information crucial to dense prediction tasks such as object detection and segmentation. In this work, we innovatively propose ConTNet (ConvolutionTransformer Network), combining transformer with ConvNet architectures to provide large receptive fields. Unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that are sensitive to hyper-parameters and extremely dependent on a pile of data augmentations when trained from scratch on a midsize dataset (e.g., ImageNet1k), ConTNet can be optimized like normal ConvNets (e.g., ResNet) and preserve an outstanding robustness. It is also worth pointing that, given identical strong data augmentations, the performance improvement of ConTNet is more remarkable than that of ResNet. We present its superiority and effectiveness on image classification and downstream tasks. For example, our ConTNet achieves 81.8% top-1 accuracy on ImageNet which is the same as DeiT-B with less than 40% computational complexity. ConTNet-M also outperforms ResNet50 as the backbone of both Faster-RCNN (by 2.6%) and Mask-RCNN (by 3.2%) on COCO2017 dataset. We hope that ConTNet could serve as a useful backbone for CV tasks and bring new ideas for model design

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Haotian Yan (3 papers)
  2. Zhe Li (210 papers)
  3. Weijian Li (39 papers)
  4. Changhu Wang (54 papers)
  5. Ming Wu (43 papers)
  6. Chuang Zhang (78 papers)
Citations (69)

Summary

  • The paper introduces a hybrid architecture that fuses convolutional layers with transformer encoders for effective local and global feature extraction.
  • The paper demonstrates competitive accuracy on ImageNet and improved performance on detection and segmentation tasks with lower computational demands.
  • The paper presents a simplified training strategy that leverages data augmentation to deliver robust and accessible transformer-based models.

An Overview of ConTNet: Integrating Convolution and Transformers in Computer Vision

The rapid evolution of deep learning techniques in computer vision (CV) has been dominated by convolutional neural networks (ConvNets), which have successfully powered various CV applications. However, ConvNets face difficulties when tasked with capturing global context due to their inherently localized nature, especially in tasks requiring dense predictions. Simultaneously, transformers, originally developed for natural language processing, have demonstrated an exceptional capability for modeling long-range dependencies, inspiring their use in vision tasks. In this context, the paper of Yan et al. introduces a novel architecture named ConTNet, which synergistically combines transformer layers with ConvNet architectures.

ConTNet aims to address two major challenges: the limited receptive field of ConvNets and the training complexities associated with transformer-based vision models. Unlike vision transformers requiring extensive data augmentation and hyper-parameter tuning, ConTNet offers a more robust and facile training pipeline similar to standard ConvNets like ResNet.

Key Contributions

  1. Hybrid Architecture: ConTNet innovatively integrates transformers with convolutional architectures by stacking Convolution-Transformer blocks. Each block includes a pair of standard transformer encoders (STEs) and a convolutional layer, allowing for the blending of local and global feature extraction.
  2. Efficiency and Robustness: In image classification tasks, ConTNet achieves competitive accuracy with considerably reduced computational complexity compared to transformers such as ViT and DeiT. For instance, ConTNet-M yields an accuracy of 81.8% on ImageNet while utilizing less than 40% of the computational load required by DeiT-B.
  3. Transfer Learning and Downstream Tasks: When applied to tasks like object detection and segmentation, ConTNet demonstrates improved performance over ResNet backbones, as evidenced by its significant AP increases in Faster-RCNN and Mask-RCNN frameworks on the COCO2017 dataset.
  4. Data Augmentation and Training Practices: ConTNet's training strategy benefits from data augmentations to improve performance beyond what is achievable with ResNets, due to reduced overfitting risks associated with transformer architectures.

Empirical Results

  • Image Classification: ConTNet-M surpasses ResNet50 by 1.6% in ImageNet top-1 accuracy with 25% less computational demand.
  • Object Detection: As a backbone for object detection models like Faster-RCNN, ConTNet-M improves AP by 2.6% compared to ResNet50.
  • Instance Segmentation: Similarly, in Mask-RCNN, ConTNet-M achieves a 3.4% increase over ResNet50 in COCO region-based segmentation performance.

Theoretical and Practical Implications

The integration of convolutional operations with transformers in ConTNet offers a promising direction for advancing hybrid architectures in CV. The approach leverages the strengths of both paradigms, providing an effective strategy for tasks requiring large receptive fields and global context awareness. The simplicity of ConTNet's training regime underscores its potential in broadening the accessibility and deployment of transformer-augmented models in various practical applications without the need for extensive computational resources or sophisticated hyper-parameter tuning.

Future Directions

Research into ConTNet and similar architectures could explore further optimizations, such as employing advanced transformer variations or novel convolution variants to enhance performance even further. The exploration of scaling strategies, efficient computations, and robustness against data variability will remain crucial for their application in real-world scenarios.

In conclusion, ConTNet effectively demonstrates the harmonious coexistence of convolutions and transformers, opening avenues for new methodologies in CV and setting the stage for future innovations in hybrid neural architectures.