Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition (2310.19380v2)

Published 30 Oct 2023 in cs.CV

Abstract: Recent studies have integrated convolution into transformers to introduce inductive bias and improve generalization performance. However, the static nature of conventional convolution prevents it from dynamically adapting to input variations, resulting in a representation discrepancy between convolution and self-attention as self-attention calculates attention matrices dynamically. Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels. These two limitations result in a sub-optimal representation capacity of the constructed networks. To find a solution, we propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way. D-Mixer works by applying an efficient global attention module and an input-dependent depthwise convolution separately on evenly split feature segments, endowing the network with strong inductive bias and an enlarged effective receptive field. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network that delivers compelling performance. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost. Furthermore, TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving top-1 accuracy of 83.8% and 84.6% respectively, with reasonable computational costs. Additionally, our proposed network architecture demonstrates strong generalization capabilities in various dense prediction tasks, outperforming other state-of-the-art networks while having lower computational costs. Code is available at https://github.com/LMMMEng/TransXNet.

An Examination of TransXNet: A Novel Approach to Visual Recognition with Dual Dynamic Token Mixer

The paper "TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition" introduces an innovative neural network architecture designed to effectively blend the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The authors propose TransXNet, a hybrid vision backbone that combines a CNN's inductive bias and ViTs' capability for long-range dependency modeling by employing a Dual Dynamic Token Mixer (D-Mixer).

Key Contributions

  1. Dual Dynamic Token Mixer (D-Mixer): The core contribution is the D-Mixer, which integrates dynamic input-dependent depthwise convolution and global self-attention mechanisms. This mixer enables the extraction of both local details and global context simultaneously, thus enhancing representation capacity.
  2. Hybrid Network Architecture: TransXNet advocates a hybrid architecture using D-Mixer as its fundamental block. This design successfully addresses the limitations of standard convolutions, such as static kernel dependencies, while retaining the dynamic advantages of transformers.
  3. Empirical Results: The network achieves impressive results on the ImageNet-1K, COCO, and ADE20K datasets for tasks like classification, detection, and segmentation. Notably, TransXNet demonstrates superior efficiency, achieving similar or better performance with less computational overhead compared to state-of-the-art methods like Swin Transformer.

Technical Analysis

  • Input-Dependent Operations: The D-Mixer harnesses input-dependent depthwise convolution and Overlapping Spatial Reduction Attention (OSRA) to allow dynamic feature extraction, overcoming the static nature of traditional convolution kernels.
  • Expanded Receptive Field: By integrating global attention across all stages and leveraging dynamic convolutions, TransXNet significantly extends the effective receptive field, thereby improving the model's capability to capture contextual information.
  • Multi-Scale Token Aggregation: The architecture includes a Multi-scale Feed-forward Network (MS-FFN) allowing nuanced multi-scale token aggregation, accommodating features at different resolutions, which is crucial for tasks involving complex scenes.

Experimental Evaluation

  • ImageNet-1K Classification: TransXNet-T reaches a top-1 accuracy of 81.6% while using less than half the computational resources required by Swin-T, reinforcing the efficiency of the proposed architecture.
  • COCO Detection and Segmentation: The model's performance in object detection and segmentation tasks demonstrates its strong generalization capabilities. It consistently surpasses contemporary models in both average precision and computational efficiency.
  • ADE20K Semantic Segmentation: On ADE20K, TransXNet maintains superior accuracy across various model sizes, achieving significant improvements in mean Intersection over Union (mIoU) metrics.

Implications and Future Work

The dual dynamic mixing approach outlines a path for further integration of convolutional and attentional mechanisms, promising improvements in neural network performance on complex visual tasks. The architecture’s efficient handling of feature dynamics makes it a compelling choice for resource-constrained environments.

Future research directions might include leveraging Neural Architecture Search (NAS) to optimize the proposed architectural components further and exploring specialized implementations to enhance inference speed. Additionally, varying channel ratios and dynamically adjusting feature processing techniques across different network stages offer potential areas for expansion.

The work contributes to a growing body of research focused on bridging the gap between CNNs and Transformers, offering a foundation for more adaptable and context-aware vision models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
  2. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International Conference on Machine Learning.   PMLR, 2021, pp. 10 347–10 357.
  3. Z. Dai, H. Liu, Q. V. Le, and M. Tan, “Coatnet: Marrying convolution and attention for all data sizes,” Advances in Neural Information Processing Systems, vol. 34, pp. 3965–3977, 2021.
  4. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  5. X. Pan, C. Ge, R. Lu, S. Song, G. Chen, Z. Huang, and G. Huang, “On the integration of self-attention and convolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 815–825.
  6. Q. Chen, Q. Wu, J. Wang, Q. Hu, T. Hu, E. Ding, J. Cheng, and J. Wang, “Mixformer: Mixing features across windows and dimensions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5249–5259.
  7. J. Guo, K. Han, H. Wu, Y. Tang, X. Chen, Y. Wang, and C. Xu, “Cmt: Convolutional neural networks meet vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 175–12 185.
  8. Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, “Maxvit: Multi-axis vision transformer,” in European Conference on Computer Vision.   Springer, 2022.
  9. W. Luo, Y. Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,” Advances in neural information processing systems, vol. 29, 2016.
  10. K. Li, Y. Wang, J. Zhang, P. Gao, G. Song, Y. Liu, H. Li, and Y. Qiao, “Uniformer: Unifying convolution and self-attention for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  11. W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
  12. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255.
  13. W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu, L. Lu, H. Li et al., “Internimage: Exploring large-scale vision foundation models with deformable convolutions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  14. Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 976–11 986.
  15. X. Ding, X. Zhang, J. Han, and G. Ding, “Scaling up your kernels to 31x31: Revisiting large kernel design in cnns,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 963–11 975.
  16. S. Liu, T. Chen, X. Chen, X. Chen, Q. Xiao, B. Wu, M. Pechenizkiy, D. Mocanu, and Z. Wang, “More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity,” in International Conference on Learning Representations, 2023.
  17. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  18. X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 124–12 134.
  19. X. Pan, T. Ye, Z. Xia, S. Song, and G. Huang, “Slide-transformer: Hierarchical vision transformer with local self-attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  20. S. Ren, D. Zhou, S. He, J. Feng, and X. Wang, “Shunted self-attention via multi-scale token aggregation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 853–10 862.
  21. W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 568–578.
  22. Y.-H. Wu, Y. Liu, X. Zhan, and M.-M. Cheng, “P2t: Pyramid pooling transformer for scene understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  23. H. Zhang, W. Hu, and X. Wang, “Fcaformer: Forward cross attention in hybrid vision transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6060–6069.
  24. N. Li, Y. Chen, W. Li, Z. Ding, D. Zhao, and S. Nie, “Bvit: Broad attention-based vision transformer,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  25. T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Girshick, “Early convolutions help transformers see better,” Advances in Neural Information Processing Systems, vol. 34, pp. 30 392–30 400, 2021.
  26. Y. Li, G. Yuan, Y. Wen, J. Hu, G. Evangelidis, S. Tulyakov, Y. Wang, and J. Ren, “Efficientformer: Vision transformers at mobilenet speed,” Advances in Neural Information Processing Systems, vol. 35, pp. 12 934–12 949, 2022.
  27. B. Yang, G. Bender, Q. V. Le, and J. Ngiam, “Condconv: Conditionally parameterized convolutions for efficient inference,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  28. J. He, Z. Deng, and Y. Qiao, “Dynamic multi-scale filters for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3562–3572.
  29. Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamic convolution: Attention over convolution kernels,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 030–11 039.
  30. Q. Han, Z. Fan, Q. Dai, L. Sun, M.-M. Cheng, J. Liu, and J. Wang, “On the connection between local attention and dynamic depth-wise convolution,” in International Conference on Learning Representations, 2022.
  31. W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 819–10 829.
  32. X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, “Twins: Revisiting the design of spatial attention in vision transformers,” Advances in Neural Information Processing Systems, vol. 34, pp. 9355–9366, 2021.
  33. F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
  34. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision.   Springer, 2014, pp. 740–755.
  35. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
  36. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  37. G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in European Conference on Computer Vision.   Springer, 2016, pp. 646–661.
  38. B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet classifiers generalize to imagenet?” in International conference on machine learning.   PMLR, 2019, pp. 5389–5400.
  39. C. Yang, S. Qiao, Q. Yu, X. Yuan, Y. Zhu, A. Yuille, H. Adam, and L.-C. Chen, “Moat: Alternating mobile convolution and attention brings strong vision models,” in International Conference on Learning Representations, 2023.
  40. R. Wightman, H. Touvron, and H. Jégou, “Resnet strikes back: An improved training procedure in timm,” arXiv preprint arXiv:2110.00476, 2021.
  41. I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár, “Designing network design spaces,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 428–10 436.
  42. S. Tang, J. Zhang, S. Zhu, and P. Tan, “Quadtree attention for vision transformers,” in International Conference on Learning Representations, 2022.
  43. Y. Lee, J. Kim, J. Willette, and S. J. Hwang, “Mpvit: Multi-path vision transformer for dense prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7287–7296.
  44. R. Yang, H. Ma, J. Wu, Y. Tang, X. Xiao, M. Zheng, and X. Li, “Scalablevit: Rethinking the context-oriented generalization of vision transformer,” in European Conference on Computer Vision.   Springer, 2022, pp. 480–496.
  45. W. Wang, L. Yao, L. Chen, B. Lin, D. Cai, X. He, and W. Liu, “Crossformer: A versatile vision transformer hinging on cross-scale attention,” in International Conference on Learning Representations, 2022.
  46. K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu et al., “Mmdetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155, 2019.
  47. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  48. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  49. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  50. P. Zhang, X. Dai, J. Yang, B. Xiao, L. Yuan, L. Zhang, and J. Gao, “Multi-scale vision longformer: A new vision transformer for high-resolution image encoding,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2998–3008.
  51. M. Contributors, “MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark,” https://github.com/open-mmlab/mmsegmentation, 2020.
  52. A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6399–6408.
  53. X. Chu, Z. Tian, B. Zhang, X. Wang, and C. Shen, “Conditional positional encodings for vision transformers,” in International Conference on Learning Representations, 2022.
  54. M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning.   PMLR, 2019, pp. 6105–6114.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Meng Lou (6 papers)
  2. Hong-Yu Zhou (50 papers)
  3. Sibei Yang (61 papers)
  4. Yizhou Yu (148 papers)
Citations (18)