Dual Vision Transformer (2207.04976v2)

Published 11 Jul 2022 in cs.CV and cs.AI

Abstract: Prior works have proposed several strategies to reduce the computational cost of self-attention mechanism. Many of these works consider decomposing the self-attention procedure into regional and local feature extraction procedures that each incurs a much smaller computational complexity. However, regional information is typically only achieved at the expense of undesirable information lost owing to down-sampling. In this paper, we propose a novel Transformer architecture that aims to mitigate the cost issue, named Dual Vision Transformer (Dual-ViT). The new architecture incorporates a critical semantic pathway that can more efficiently compress token vectors into global semantics with reduced order of complexity. Such compressed global semantics then serve as useful prior information in learning finer pixel level details, through another constructed pixel pathway. The semantic pathway and pixel pathway are then integrated together and are jointly trained, spreading the enhanced self-attention information in parallel through both of the pathways. Dual-ViT is henceforth able to reduce the computational complexity without compromising much accuracy. We empirically demonstrate that Dual-ViT provides superior accuracy than SOTA Transformer architectures with reduced training complexity. Source code is available at \url{https://github.com/YehLi/ImageNetModel}.

Authors (6)

Ting Yao (127 papers)
Yehao Li (35 papers)
Yingwei Pan (77 papers)
Yu Wang (939 papers)
Xiao-Ping Zhang (107 papers)
Tao Mei (209 papers)

Citations (61)

View on Semantic Scholar

Summary

The paper introduces a dual-pathway transformer architecture that decouples semantic token compression from fine pixel-level analysis to reduce computational complexity.
It achieves 85.7% top-1 accuracy on ImageNet with only 41.1% FLOPs and 37.8% of parameters compared to VOLO, highlighting its efficiency.
The model improves dense predictions on COCO by over 1.2% mAP while using 48.0% fewer parameters, demonstrating effective resource optimization.

Dual Vision Transformer: A Novel Approach to Vision Transformer Architectures

The paper "Dual Vision Transformer" introduces a new architectural innovation in the domain of Vision Transformers (ViTs), specifically addressing the computational inefficiencies commonly associated with existing self-attention mechanisms used in high-resolution input processing. This work proposes an architecture termed Dual Vision Transformer (Dual-ViT), which integrates a two-pathway design aimed at enhancing accuracy while significantly reducing computational complexity.

The Dual-ViT architecture cleverly introduces two distinct functional pathways: a semantic pathway and a pixel pathway. The semantic pathway is devised to compress token vectors into global semantic representations with a reduced computational order of complexity. These semantic tokens serve as prior knowledge for the pixel pathway, which focuses on learning finer pixel-level details. The interactions between these pathways are orchestrated to enhance the self-attention mechanism employed across the architecture. The dual-pathway structure not only facilitates a reduction in computational burden but does so with minimal impact on accuracy, offering an added advantage over traditional ViT designs.

Empirical evaluation demonstrates the efficacy of Dual-ViT. On the ImageNet benchmark, Dual-ViT achieves a top-1 accuracy of 85.7% with only 41.1% of FLOPs and 37.8% of the parameters required by the prevalent VOLO architecture. Furthermore, when applied to dense prediction tasks, such as object detection and instance segmentation on the COCO dataset, Dual-ViT shows improvements over PVT by over 1.2% in mean Average Precision (mAP) while using 48.0% fewer parameters.

The implications of this work are multifaceted. Practically, the Dual-ViT architecture offers a viable solution for deploying transformer models in scenarios where computational resources or real-time processing requirements are constrained. Theoretically, this paper advances the design philosophy of transformers by illustrating the potential benefits of decoupling semantic comprehension from finer-grained visual analysis. This design principle may underpin future developments that exploit hierarchical and pathway-based processing in transformer models.

In summary, Dual-ViT successfully reduces the complexity of self-attention operations in transformers, crucially without sacrificing performance, and thereby reinforces the feasibility of transformers in diverse real-world applications. This work suggests a possible trajectory for future research in the development of transformers, potentially exploring more nuanced interactions between different levels of feature abstractions and further optimizing the computational efficiency of such models.

PDF Markdown

Related Papers

GitHub

GitHub - YehLi/ImageNetModel: Official ImageNet Model repository (191 stars)