- The paper introduces a streamlined hierarchical vision transformer that uses sparse MAE pretraining to inherently learn spatial biases while eliminating complex modules.
- It demonstrates significant speed improvements, achieving up to 2.4x faster performance and surpassing benchmarks on datasets like Kinetics-400 and Something-Something-v2.
- The study provides empirical evidence that removing architectural complexities in favor of robust pretraining can maintain high accuracy and efficiency.
Introduction
The paper presents Hiera, a simplified hierarchical vision transformer model that cuts down on the complexity prevalent in many state-of-the-art models by employing Masked Autoencoder (MAE) pretraining. This strategy enables the model to learn spatial biases inherently, eliminating the need for specialized architectural components that often slow down these models.
Key Contributions
- Simplicity and Efficiency: Hiera is designed to eschew the cumbersome modules found in models like MViT and Swin. By leveraging MAE pretraining, Hiera maintains competitive accuracy while being significantly faster during both inference and training.
- Sparse MAE Pretraining: Extensive experiments demonstrate the efficacy of sparse MAE pretraining. The authors cleverly utilize separate mask units for aligning MAE with the hierarchical transformers, addressing the challenges of rigid 2D grids.
- Model Simplification: The paper systematically removes architectural complexities such as relative position embeddings, attention residuals, and convolutional layers, illustrating that these are unnecessary when spatial biases are learned through pretraining.
Empirical Results
- Strong Numerical Performance: Hiera surpasses the state-of-the-art on various image recognition tasks and performs exceptionally well on video recognition tasks with substantial improvements over previous works on datasets like Kinetics-400 and Something-Something-v2.
- Speed Improvements: The model achieves up to 2.4x faster performance compared to MViTv2 on images, and significantly reduces inference times on video tasks, making it highly efficient for practical deployment.
Implications and Future Directions
The results suggest that the removal of complex components from vision transformers is feasible, provided that robust pretraining techniques like MAE are employed. This raises questions about the necessity of certain architectural features traditionally deemed essential in transformer-based models.
Future developments could focus on further integrating Hiera-style architectures with additional pretraining paradigms, exploring the generalizability of this simplification strategy across other domains in AI.
Conclusion
"Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles" takes a bold step towards simplifying vision transformers without compromising performance. By demonstrating that learned biases can effectively replace manually designed modules, this research opens avenues for more efficient model designs in computer vision and beyond.