Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles (2306.00989v1)

Published 1 Jun 2023 in cs.CV and cs.LG

Abstract: Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

Authors (13)

Chaitanya Ryali (4 papers)
Yuan-Ting Hu (12 papers)
Daniel Bolya (14 papers)
Chen Wei (72 papers)
Haoqi Fan (33 papers)
Po-Yao Huang (31 papers)
Vaibhav Aggarwal (8 papers)
Arkabandhu Chowdhury (8 papers)
Omid Poursaeed (19 papers)
Judy Hoffman (75 papers)
Jitendra Malik (211 papers)
Yanghao Li (43 papers)
Christoph Feichtenhofer (52 papers)

Citations (99)

View on Semantic Scholar

Summary

Overview of "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles"

Introduction

The paper presents Hiera, a simplified hierarchical vision transformer model that cuts down on the complexity prevalent in many state-of-the-art models by employing Masked Autoencoder (MAE) pretraining. This strategy enables the model to learn spatial biases inherently, eliminating the need for specialized architectural components that often slow down these models.

Key Contributions

Simplicity and Efficiency: Hiera is designed to eschew the cumbersome modules found in models like MViT and Swin. By leveraging MAE pretraining, Hiera maintains competitive accuracy while being significantly faster during both inference and training.
Sparse MAE Pretraining: Extensive experiments demonstrate the efficacy of sparse MAE pretraining. The authors cleverly utilize separate mask units for aligning MAE with the hierarchical transformers, addressing the challenges of rigid 2D grids.
Model Simplification: The paper systematically removes architectural complexities such as relative position embeddings, attention residuals, and convolutional layers, illustrating that these are unnecessary when spatial biases are learned through pretraining.

Empirical Results

Strong Numerical Performance: Hiera surpasses the state-of-the-art on various image recognition tasks and performs exceptionally well on video recognition tasks with substantial improvements over previous works on datasets like Kinetics-400 and Something-Something-v2.
Speed Improvements: The model achieves up to 2.4x faster performance compared to MViTv2 on images, and significantly reduces inference times on video tasks, making it highly efficient for practical deployment.

Implications and Future Directions

The results suggest that the removal of complex components from vision transformers is feasible, provided that robust pretraining techniques like MAE are employed. This raises questions about the necessity of certain architectural features traditionally deemed essential in transformer-based models.

Future developments could focus on further integrating Hiera-style architectures with additional pretraining paradigms, exploring the generalizability of this simplification strategy across other domains in AI.

Conclusion

"Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles" takes a bold step towards simplifying vision transformers without compromising performance. By demonstrating that learned biases can effectively replace manually designed modules, this research opens avenues for more efficient model designs in computer vision and beyond.

PDF Markdown

Related Papers

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling (2022)
Multiscale Vision Transformers (2021)
Learned Queries for Efficient Local Attention (2021)
Reversible Vision Transformers (2023)
Rethinking Hierarchies in Pre-trained Plain Vision Transformer (2022)

GitHub

GitHub - facebookresearch/hiera: Hiera: A fast, powerful, and simple hierarchical vision transformer. (708 stars)