MetaFormer Baselines for Vision (2210.13452v4)

Published 24 Oct 2022 in cs.CV, cs.AI, and cs.LG

Abstract: MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, without focusing on token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance. By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works well with arbitrary token mixers. When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results when new token mixers are adopted. (3) MetaFormer effortlessly offers state-of-the-art results. With just conventional token mixers dated back five years ago, the models instantiated from MetaFormer already beat state of the art. (a) ConvFormer outperforms ConvNeXt. Taking the common depthwise separable convolutions as the token mixer, the model termed ConvFormer, which can be regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. (b) CAFormer sets new record on ImageNet-1K. By simply applying depthwise separable convolutions as token mixer in the bottom stages and vanilla self-attention in the top stages, the resulting model CAFormer sets a new record on ImageNet-1K: it achieves an accuracy of 85.5% at 224x224 resolution, under normal supervised training without external data or distillation. In our expedition to probe MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of activation compared with GELU yet achieves better performance. We expect StarReLU to find great potential in MetaFormer-like models alongside other neural networks.

Authors (8)

Weihao Yu (36 papers)
Chenyang Si (36 papers)
Pan Zhou (221 papers)
Mi Luo (9 papers)
Yichen Zhou (21 papers)
Jiashi Feng (297 papers)
Shuicheng Yan (275 papers)
Xinchao Wang (203 papers)

Citations (113)

View on Semantic Scholar

Summary

The paper shows that MetaFormer architectures using simple token mixers can achieve robust performance, with IdentityFormer and RandFormer hitting over 80% accuracy on ImageNet-1K.
The methodology employs diverse token mixers such as identity mapping, random matrices, and convolutions to validate MetaFormer’s adaptability and efficiency on vision tasks.
The study introduces StarReLU, an activation function that reduces FLOPs by 71% compared to GELU, highlighting a practical route to improve computational efficiency.

Insightful Overview of the "MetaFormer Baselines for Vision" Paper

The paper presented in "MetaFormer Baselines for Vision" introduces an examination and validation of the hypothesis that the architecture named MetaFormer, abstracted from Transformers, is instrumental in achieving formidable performances across vision tasks. This paper diverges from the traditional focus on complex token mixers and instead challenges the architectural significance by utilizing basic or well-established token mixing operations, demonstrating how foundational principles can still lead to impressive performance metrics.

MetaFormer, characterized by its distinct capability to accommodate varied token mixers, positions itself as a reliable architecture for competitive vision models. This is illustrated through several baseline models employing diverse token mixer implementations:

IdentityFormer and RandFormer: These models are engineered by applying the simplest token mixers to ascertain MetaFormer’s foundational performance capacity and adaptability. IdentityFormer employs identity mapping as a mixer and remarkably secures more than 80% accuracy on ImageNet-1K, affirming a robust lower bound on performance. RandFormer, using a random matrix for mixing, surpasses IdentityFormer, achieving over 81% accuracy, reflecting MetaFormer's versatile capability to integrate arbitrary token mixers.
ConvFormer and CAFormer: These models embody configurations where more traditional token mixers, like separable convolutions, are utilized. ConvFomer, with only depthwise separable convolutions, outstrips the pure CNN model ConvNeXt. CAFormer, which incorporates self-attention in upper stages and convolutions in lower stages, achieves a groundbreaking result of 85.5% accuracy on ImageNet-1K, setting a new benchmark under regular supervised settings without external data usage.

The significant findings include:

The demonstration that MetaFormer acts as an adaptable backbone, ensuring performance well above baseline irrespective of token mixers’ sophistication, as evidenced by the performance of IdentityFormer.
The empirical results establish MetaFormer’s robust efficiency as a universal architecture capable of enhancing the performance of models like ConvFormer and CAFormer effortlessly, in some cases surpassing leading models with state-of-the-art results.

The investigation also presents StarReLU, an activation function that achieves competitive levels of performance while reducing FLOPs by 71% compared to GELU. This novel activation function contributes to minimizing computational complexity while maintaining, or even enhancing, model accuracy, particularly within MetaFormer-like architectures.

Implications and Future Prospects

The practical implications emphasized in this paper suggest that future architectural design could hinge more critically on the underlying MetaFormer framework as opposed to manipulating complex token mixing strategies. The ease of transitioning between token mixers offers scalability and flexibility, rendering these models potent candidates for deployment in varied vision applications, potentially offering a more resource-efficient path to achieving high accuracy.

On a theoretical plane, this exploration warrants further research into understanding the intrinsic qualities of MetaFormer that contribute to its adaptability and performance security. Such knowledge could lead to more nuanced interpretations and potentially novel architectures inspired by MetaFormer principles.

Exploring advanced token mixers with the MetaFormer architecture could likely result in surpassing current established records in model performance. The promise of StarReLU also beckons further exploration within wider neural network architectures, predicting a fresh wave of computational efficiency improvements without compromising accuracy.

By underscoring the profound potential of an abstracted architecture like MetaFormer, this paper sets a foundation for future efficacy-driven, scalable model developments in computer vision, driven by architectural efficiency rather than token mixer complexity.

PDF Markdown

Related Papers

YouTube

Show All Videos