Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions (2403.07392v3)

Published 12 Mar 2024 in cs.CV

Abstract: Although Vision Transformer (ViT) has achieved significant success in computer vision, it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems, which introduce additional pre-training costs. Therefore, we present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid multi-receptive field convolutional features into the ViT architecture, which effectively alleviates the problems of limited local information interaction and single-feature representation in ViT. (2) We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features, which is beneficial for handling dense prediction tasks. (3) We evaluate the performance of ViT-CoMer across various dense prediction tasks, different frameworks, and multiple advanced pre-training. Notably, our ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data, and 62.1% mIoU on ADE20K val, both of which are comparable to state-of-the-art methods. We hope ViT-CoMer can serve as a new backbone for dense prediction tasks to facilitate future research. The code will be released at https://github.com/Traffic-X/ViT-CoMer.

Overview of ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions

The paper introduces ViT-CoMer, a novel architecture designed to enhance the performance of Vision Transformers (ViTs) in dense prediction tasks. Dense prediction tasks, such as object detection, instance segmentation, and semantic segmentation, demand the capture of intricate and localized features from images. The traditional ViT, while successful in general vision tasks, struggles with dense predictions due to its inherent lack of local interaction within patches and limited feature scale diversity. This paper proposes a solution to these challenges without resorting to costly pre-training procedures typical of other transformer architectures in computer vision.

Key Innovations

  1. Integration of Convolutional Features: ViT-CoMer introduces spatial pyramid multi-receptive field convolutional features into the ViT architecture. This integration addresses the existing ViT limitations by enhancing local information interaction and feature representation diversity. By employing a convolutional module, the architecture exploits the inherent advantage of convolutions in capturing local patterns.
  2. Bidirectional CNN-Transformer Interaction: The authors propose a simple, efficient module that facilitates bidirectional feature interaction between CNN layers and the transformer. This interaction occurs across multiple scales, thereby effectively leveraging hierarchical feature fusion advantageous for dense predictions.
  3. Pre-training-free Framework: One of the standout features of ViT-CoMer is its ability to bypass extensive pre-training requirements. The architecture allows for the direct utilization of open-source, advanced pre-trained weights from previously established transformers, ensuring time and resource efficiency without sacrificing performance.

Experimental Results

ViT-CoMer demonstrates commendable performance across several benchmarking datasets:

  • In object detection tasks evaluated on the COCO val2017 dataset, ViT-CoMer-L achieved an Average Precision (AP) of 64.3% without extra training data, comparable to state-of-the-art methods.
  • For semantic segmentation on the ADE20K val dataset, the method reached 62.1% mIoU, again matching the performance of leading architectures.

The architecture’s flexibility is further underscored by successful tests on various pre-training scenarios and dense prediction benchmarks, showcasing its adaptability and robustness.

Implications and Future Directions

Practical Implications: ViT-CoMer presents a compelling option for practitioners seeking efficient, high-performance models for dense prediction tasks. Its integration of convolutional features with transformers offers a balanced approach, leveraging the best aspects of both methodologies. This makes it a viable choice in applications where dense predictions are critical, like autonomous driving and medical imaging.

Theoretical Insights: By addressing the interaction challenges within ViT architectures through convolutional enhancements, the paper introduces a pathway for reconciling the strengths of CNNs and transformers. This offers a blueprint for future research in hybrid architectures that seek to optimize the trade-offs between local and global feature extractions.

Speculation on Future AI Developments: The exploration of architectures like ViT-CoMer signals a broader trend in AI towards more integrated and hybrid models. Future developments could see even deeper integrations of various neural network paradigms, potentially leading to unified frameworks that negate the need for architecture-specific specializations in vision tasks.

Conclusion

ViT-CoMer is a strategic advancement in the quest to improve Vision Transformer performance for dense prediction tasks. By innovatively combining convolutional multi-scale feature interactions within the framework of a ViT, the authors present a model that not only elevates performance but also exhibits practicality in terms of pre-training and application scalability. This contributes a valuable perspective to the ongoing development of efficient, versatile models capable of handling complex vision tasks in real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  3. Yolov4: Optimal speed and accuracy of object detection. 2020.
  4. Reversible column networks. In ICLR, 2023.
  5. Cascade r-cnn: High quality object detection and instance segmentation. TPAMI, 43(5):1483–1498, 2019.
  6. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
  7. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  8. Mixformer: Mixing features across windows and dimensions. In CVPR, pages 5249–5259, 2022.
  9. Vision transformer adapter for dense predictions. ICLR, 2023.
  10. Masked-attention mask transformer for universal image segmentation. 2022.
  11. Twins: Revisiting the design of spatial attention in vision transformers. NeurIPS, 34:9355–9366, 2021.
  12. MMSegmentation Contributors. Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
  15. Cmt: Convolutional neural networks meet vision transformers. In CVPR, pages 12175–12185, 2022.
  16. Flatten transformer: Vision transformer using focused linear attention. In ICCV, 2023.
  17. Neighborhood attention transformer. In CVPR, pages 6185–6194, 2023.
  18. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  19. Mask r-cnn. TPAMI, 2017.
  20. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650, 2021.
  21. Visual prompt tuning. In ECCV, pages 709–727. Springer, 2022.
  22. Mpvit: Multi-path vision transformer for dense prediction. In CVPR, pages 7287–7296, 2022.
  23. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, 2022a.
  24. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676, 2022b.
  25. Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021.
  26. Exploring plain vision transformer backbones for object detection. In ECCV, pages 280–296. Springer, 2022c.
  27. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  28. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
  29. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022.
  30. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  31. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  32. Yolo9000: Better, faster, stronger. In CVPR, pages 6517–6525, 2017.
  33. Yolov3: An incremental improvement. arXiv, 2018.
  34. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
  35. Open-transmind: A new baseline and benchmark for 1st foundation model challenge of intelligent transportation. In CVPR, pages 6327–6334, 2023.
  36. Inception transformer. NeurIPS, 35:23495–23509, 2022.
  37. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021a.
  38. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021b.
  39. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357. PMLR, 2021.
  40. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
  41. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
  42. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In CVPR, 2023a.
  43. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, pages 14408–14419, 2023b.
  44. Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018.
  45. Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641, 2021.
  46. Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task. In CVPR, pages 21341–21350, 2022.
  47. 1% vs 100%: Parameter-efficient low rank adapter for dense predictions. In CVPR, pages 20116–20126, 2023.
  48. Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In CVPR, pages 21361–21370, 2022a.
  49. V2x-seq: A large-scale sequential dataset for vehicle-infrastructure cooperative perception and forecasting. In CVPR, pages 5486–5495, 2023.
  50. Metaformer is actually what you need for vision. In CVPR, pages 10819–10829, 2022b.
  51. Spanet: Frequency-balancing token mixer using spectral pooling aggregation modulation. In ICCV, pages 6113–6124, 2023.
  52. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
  53. Semantic understanding of scenes through the ade20k dataset. IJCV, 127:302–321, 2019.
  54. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  55. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. In CVPR, pages 16804–16815, 2022.
  56. Detrs with collaborative hybrid assignments training, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chunlong Xia (5 papers)
  2. Xinliang Wang (8 papers)
  3. Feng Lv (5 papers)
  4. Xin Hao (22 papers)
  5. Yifeng Shi (17 papers)
Citations (25)