TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition (2310.19380v4)

Published 30 Oct 2023 in cs.CV

Abstract: Recent studies have integrated convolutions into transformers to introduce inductive bias and improve generalization performance. However, the static nature of conventional convolution prevents it from dynamically adapting to input variations, resulting in a representation discrepancy between convolution and self-attention as the latter computes attention maps dynamically. Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels. These two limitations result in a sub-optimal representation capacity of the entire network. To find a solution, we propose a lightweight Dual Dynamic Token Mixer (D-Mixer) to simultaneously learn global and local dynamics via computing input-dependent global and local aggregation weights. D-Mixer works by applying an efficient global attention module and an input-dependent depthwise convolution separately on evenly split feature segments, endowing the network with strong inductive bias and an enlarged receptive field. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network that delivers compelling performance. In the ImageNet-1K classification, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost. Furthermore, TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving top-1 accuracy of 83.8% and 84.6% respectively, with reasonable computational costs. Additionally, our proposed network architecture demonstrates strong generalization capabilities in various dense prediction tasks, outperforming other state-of-the-art networks while having lower computational costs. Code is publicly available at https://github.com/LMMMEng/TransXNet.

References (54)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces a Dual Dynamic Token Mixer that combines dynamic depthwise convolution with global self-attention to capture both local details and global context.
It proposes a hybrid network architecture that leverages CNN inductive biases and Transformer long-range dependencies for efficient visual recognition.
Empirical results on ImageNet, COCO, and ADE20K showcase superior accuracy and reduced computational overhead compared to leading state-of-the-art models.

An Examination of TransXNet: A Novel Approach to Visual Recognition with Dual Dynamic Token Mixer

The paper "TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition" introduces an innovative neural network architecture designed to effectively blend the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The authors propose TransXNet, a hybrid vision backbone that combines a CNN's inductive bias and ViTs' capability for long-range dependency modeling by employing a Dual Dynamic Token Mixer (D-Mixer).

Key Contributions

Dual Dynamic Token Mixer (D-Mixer): The core contribution is the D-Mixer, which integrates dynamic input-dependent depthwise convolution and global self-attention mechanisms. This mixer enables the extraction of both local details and global context simultaneously, thus enhancing representation capacity.
Hybrid Network Architecture: TransXNet advocates a hybrid architecture using D-Mixer as its fundamental block. This design successfully addresses the limitations of standard convolutions, such as static kernel dependencies, while retaining the dynamic advantages of transformers.
Empirical Results: The network achieves impressive results on the ImageNet-1K, COCO, and ADE20K datasets for tasks like classification, detection, and segmentation. Notably, TransXNet demonstrates superior efficiency, achieving similar or better performance with less computational overhead compared to state-of-the-art methods like Swin Transformer.

Technical Analysis

Input-Dependent Operations: The D-Mixer harnesses input-dependent depthwise convolution and Overlapping Spatial Reduction Attention (OSRA) to allow dynamic feature extraction, overcoming the static nature of traditional convolution kernels.
Expanded Receptive Field: By integrating global attention across all stages and leveraging dynamic convolutions, TransXNet significantly extends the effective receptive field, thereby improving the model's capability to capture contextual information.
Multi-Scale Token Aggregation: The architecture includes a Multi-scale Feed-forward Network (MS-FFN) allowing nuanced multi-scale token aggregation, accommodating features at different resolutions, which is crucial for tasks involving complex scenes.

Experimental Evaluation

ImageNet-1K Classification: TransXNet-T reaches a top-1 accuracy of 81.6% while using less than half the computational resources required by Swin-T, reinforcing the efficiency of the proposed architecture.
COCO Detection and Segmentation: The model's performance in object detection and segmentation tasks demonstrates its strong generalization capabilities. It consistently surpasses contemporary models in both average precision and computational efficiency.
ADE20K Semantic Segmentation: On ADE20K, TransXNet maintains superior accuracy across various model sizes, achieving significant improvements in mean Intersection over Union (mIoU) metrics.

Implications and Future Work

The dual dynamic mixing approach outlines a path for further integration of convolutional and attentional mechanisms, promising improvements in neural network performance on complex visual tasks. The architecture’s efficient handling of feature dynamics makes it a compelling choice for resource-constrained environments.

Future research directions might include leveraging Neural Architecture Search (NAS) to optimize the proposed architectural components further and exploring specialized implementations to enhance inference speed. Additionally, varying channel ratios and dynamically adjusting feature processing techniques across different network stages offer potential areas for expansion.

The work contributes to a growing body of research focused on bridging the gap between CNNs and Transformers, offering a foundation for more adaptable and context-aware vision models.

PDF Markdown

GitHub

GitHub - LMMMEng/TransXNet: TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition (167 stars)