FasterViT: Fast Vision Transformers with Hierarchical Attention (2306.06189v2)

Published 9 Jun 2023 in cs.CV, cs.AI, and cs.LG

Abstract: We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient cross-window communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy and image throughput. We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT.

PDF Abstract

Insightful Overview of FasterViT: Fast Vision Transformers with Hierarchical Attention

The paper "FasterViT: Fast Vision Transformers with Hierarchical Attention" introduces a hybrid neural network architecture designed to optimize image processing tasks across a spectrum of computer vision applications. The research employs a novel approach, integrating Convolutional Neural Networks (CNNs) with Vision Transformers (ViTs) to exploit the strengths of both local and global information processing paradigms. This architecture leverages a newly conceived Hierarchical Attention mechanism, a multi-level attention structure that deciphers self-attention complexity and enhances computational efficiency.

Core Contributions

The authors deliver FasterViT, a model that exemplifies a balance between accuracy and throughput, asserting a state-of-the-art Pareto front for image processing tasks. The model's Hierarchical Attention (HAT) module is particularly notable, as it incorporates carrier tokens for each window to ensure both local and global representation learning. This approach not only reduces quadratic complexity in self-attention but also provides effective cross-window communication, especially for high-resolution image tasks.

Technical Features

Hierarchical Attention (HAT):
- The HAT approach introduces carrier tokens to navigate the less-effective quadratic complexity traditionally seen in ViTs. These tokens allow for efficient local and global data sharing in transformed image spaces, leading to a reduction in computational overhead while maintaining accuracy.
Hybrid Architecture:
- FasterViT employs a strategic combination of convolutional blocks in early layers followed by transformer blocks in later stages. This configuration is aimed at optimizing throughput by rapidly generating high-level tokens for these subsequent transformer-based processes.
Performance Validation:
- Extensive validation across various datasets and tasks, such as ImageNet-1K for classification and COCO for object detection, demonstrates FasterViT's superior performance. Notably, the model's scalability was further established by pre-training on the larger ImageNet-21K dataset.

Quantitative Results and Evaluation

The rigorous experimental analyses reveal that FasterViT outperforms existing models in terms of throughput and accuracy. For instance, benchmarks exhibit improvements in ImageNet-1K Top-1 accuracy and enhanced computational efficiency compared to models like ConvNeXt and Swin Transformers. FasterViT achieves significant gains in image throughput metrics on GPU platforms even after employing TensorRT optimizations, underscoring its robust efficiency.

Implications and Future Directions

This research offers insights into scalable architectures for vision transformers, particularly for high-speed computer vision applications. The innovative use of Hierarchical Attention indicates potential pathways for further refinement in both hybrid and pure transformer architectures. Future developments inspired by this work could explore optimizing attention mechanisms and refining carrier token strategies to further bridge the gap in achieving real-time processing speeds in AI systems.

In summary, the FasterViT model exemplifies a sophisticated balance of performance and computational efficiency, paving the way for advancements in high-resolution image processing within the evolving landscape of AI and machine learning. The implications of such scalable and efficient models extend into practical applications where rapid and accurate image analysis is paramount.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ali Hatamizadeh (33 papers)
Greg Heinrich (12 papers)
Hongxu Yin (49 papers)
Andrew Tao (40 papers)
Jose M. Alvarez (90 papers)
Jan Kautz (215 papers)
Pavlo Molchanov (70 papers)

Citations (44)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - NVlabs/FasterViT: [ICLR 2024] Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention (856 stars)

Tweets

https://twitter.com/yin_hongxu/status/1750615487086776476

YouTube

Show All Videos