Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles (2306.00989v1)

Published 1 Jun 2023 in cs.CV and cs.LG

Abstract: Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

Citations (99)

View on Semantic Scholar

Summary

The paper introduces a streamlined hierarchical vision transformer that uses sparse MAE pretraining to inherently learn spatial biases while eliminating complex modules.
It demonstrates significant speed improvements, achieving up to 2.4x faster performance and surpassing benchmarks on datasets like Kinetics-400 and Something-Something-v2.
The study provides empirical evidence that removing architectural complexities in favor of robust pretraining can maintain high accuracy and efficiency.

Overview of "Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles"

Introduction

The paper presents Hiera, a simplified hierarchical vision transformer model that cuts down on the complexity prevalent in many state-of-the-art models by employing Masked Autoencoder (MAE) pretraining. This strategy enables the model to learn spatial biases inherently, eliminating the need for specialized architectural components that often slow down these models.

Key Contributions

Simplicity and Efficiency: Hiera is designed to eschew the cumbersome modules found in models like MViT and Swin. By leveraging MAE pretraining, Hiera maintains competitive accuracy while being significantly faster during both inference and training.
Sparse MAE Pretraining: Extensive experiments demonstrate the efficacy of sparse MAE pretraining. The authors cleverly utilize separate mask units for aligning MAE with the hierarchical transformers, addressing the challenges of rigid 2D grids.
Model Simplification: The paper systematically removes architectural complexities such as relative position embeddings, attention residuals, and convolutional layers, illustrating that these are unnecessary when spatial biases are learned through pretraining.

Empirical Results

Strong Numerical Performance: Hiera surpasses the state-of-the-art on various image recognition tasks and performs exceptionally well on video recognition tasks with substantial improvements over previous works on datasets like Kinetics-400 and Something-Something-v2.
Speed Improvements: The model achieves up to 2.4x faster performance compared to MViTv2 on images, and significantly reduces inference times on video tasks, making it highly efficient for practical deployment.

Implications and Future Directions

The results suggest that the removal of complex components from vision transformers is feasible, provided that robust pretraining techniques like MAE are employed. This raises questions about the necessity of certain architectural features traditionally deemed essential in transformer-based models.

Future developments could focus on further integrating Hiera-style architectures with additional pretraining paradigms, exploring the generalizability of this simplification strategy across other domains in AI.

Conclusion

"Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles" takes a bold step towards simplifying vision transformers without compromising performance. By demonstrating that learned biases can effectively replace manually designed modules, this research opens avenues for more efficient model designs in computer vision and beyond.

PDF Markdown

Related Papers

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling (2022)
Multiscale Vision Transformers (2021)
Learned Queries for Efficient Local Attention (2021)
Reversible Vision Transformers (2023)
Rethinking Hierarchies in Pre-trained Plain Vision Transformer (2022)

GitHub

GitHub - facebookresearch/hiera: Hiera: A fast, powerful, and simple hierarchical vision transformer. (708 stars)