EVA-02: A Visual Representation for Neon Genesis (2303.11331v2)

Published 20 Mar 2023 in cs.CV and cs.CL

Abstract: We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ~1/6 parameters and ~1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open access and open research, we release the complete suite of EVA-02 to the community at https://github.com/baaivision/EVA/tree/master/EVA-02.

PDF Abstract

An Expert Review of ": A Visual Representation for Neon Genesis"

The paper presents an innovative approach to developing a Transformer-based visual representation through a methodology called masked image modeling (MIM). This new model is constructed using a plain Vision Transformer (ViT) architecture and is pre-trained using a giant CLIP vision encoder. The model achieves competitive performance across a variety of vision tasks with significantly reduced model size and computational demands compared to existing state-of-the-art models.

Core Contributions

The research addresses the prevalent issue in the field of computer vision regarding the inaccessibility of large-scale models for the broader research community due to the extensive computational resources required. The proposed method not only demonstrates superior performance with fewer parameters but also uses entirely publicly accessible data, emphasizing the importance of democratization in AI research.

Methodology

The paper's approach is divided into two key components: architectural innovations and an advanced pre-training strategy. The proposed architecture integrates SwiGLU feedforward networks (FFNs) and employs 2D Rotary Positional Embeddings. These choices contribute to a balanced combination of simplicity and effectiveness, aligning with the latest trends in LLM architectures.

The MIM strategy utilizes a powerful CLIP vision encoder as the teacher model to guide the pre-training process of the proposed visual representation. This approach ensures efficient learning and superior performance across various benchmarks, including ImageNet-1K, COCO, and ADE20K, among others.

Notable Results

Quantitatively, the paper reports impressive metrics. The model achieves 90.0% top-1 accuracy on ImageNet-1K with only 304 million parameters, showcasing its competitive edge over larger models. Additionally, the model outperforms its predecessors in zero-shot image classification tasks, attaining an 80.4% top-1 accuracy on ImageNet-1K using the -CLIP variant.

Furthermore, the model's transfer learning capabilities are rigorously evaluated across multiple visual tasks. It consistently outclasses both comparably sized and larger models in object detection and segmentation tasks on datasets like COCO and LVIS.

Theoretical and Practical Implications

The findings of this research have significant implications both theoretically and practically. Theoretically, it underscores the potential of leveraging language-aligned vision features for enhanced visual recognition and transferability. Practically, it democratizes state-of-the-art performance by making sophisticated models accessible without necessitating prohibitive computational or financial investments.

Future Directions

The paper suggests that alternate training of pure MIM and CLIP representations can create a feedback loop that enhances performance on both fronts. This insight presents a promising avenue for the development of modular and scalable AI systems.

Conclusion

In conclusion, the research provides a compelling narrative for efficient and accessible visual representation learning. By bridging the gap between cutting-edge performance and resource affordability, it paves the way for broader participation in AI advancements, potentially enriching the entire research community. The work also sets the stage for ongoing exploration in scalable and versatile AI model design, suggesting a trajectory towards more integrated and comprehensive AI ecosystems.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yuxin Fang (14 papers)
Quan Sun (31 papers)
Xinggang Wang (163 papers)
Tiejun Huang (130 papers)
Xinlong Wang (56 papers)
Yue Cao (147 papers)

Citations (175)

View on Semantic Scholar

Related Papers

Find Related Papers