MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis (2211.09117v2)

Published 16 Nov 2022 in cs.CV

Abstract: Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.

References (67)

Citations (110)

View on Semantic Scholar

Summary

The paper introduces MAGE, which unifies image synthesis and representation learning via variable masking ratios to balance generative and recognition tasks.
It leverages semantic tokens from a vector-quantized GAN and an optional contrastive loss to enhance semantic robustness and image quality.
Experimental results on ImageNet-1K show a 9.10 FID for image generation and 78.9% linear probing accuracy, demonstrating state-of-the-art performance.

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

The paper "MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis" addresses the long-standing challenge of unifying generative modeling and representation learning within a singular architectural framework in computer vision. Typically, these models have been developed independently, potentially overlooking synergies that could enhance both tasks.

Core Contributions

MAGE, or MAsked Generative Encoder, is introduced as the first framework to effectively integrate state-of-the-art (SOTA) image generation capabilities with self-supervised representation learning. The central concept involves the innovative use of variable masking ratios in masked image modeling (MIM) pre-training. A high masking ratio supports generative model training, while a lower one facilitates representation learning, allowing both processes to occur under a unified framework.

Methodology and Architecture

The architecture utilizes semantic tokens derived from a vector-quantized GAN at both inputs and outputs, combined with strategic masking. This approach enables effective high-quality image generation and semantic-level representation learning. To refine representation quality, a contrastive loss is optionally incorporated at the encoder's output, enhancing the model's semantic robustness.

The encoder-decoder structure employed by MAGE allows the unification of both tasks through the systematic application of variable masking ratios during the training phase. Such a strategy takes advantage of the overlapping needs for high-level semantic understanding intrinsic to both generative and recognition-based tasks.

Results and Evaluation

The extensive evaluations demonstrate MAGE’s competitive performance. On ImageNet-1K, a single ViT-L model achieves a 9.10 FID in class-unconditional image generation and a 78.9% accuracy in linear probing. Notably, these figures reflect SOTA results in both image synthesis and representation learning. The proposed model achieves 11.11 FID with ViT-B, outperforming the prior best result of 20.68 by a substantial margin. Additionally, with weak augmentations, the model further reduces the FID score.

Implications and Future Work

The integration of generative modeling with representation learning implies potential practical applications in domains requiring both processes, such as enhanced photo-editing tools and advanced augmented reality systems. Theoretically, this research contributes to the understanding of how such tasks can reinforce one another once effectively unified.

Looking forward, expanding MAGE's applicability to datasets of broader scale, such as JFT300, may reveal further potential in effectively bridging visual comprehension and generation in artificial intelligence frameworks. Fine-tuning and optimizing the balance between masking ratios and adding contextual learning layers could enhance the robustness and applicability of future iterations of such models.

In conclusion, MAGE represents a significant step toward versatile, unified computer vision models capable of pushing the frontiers of what cohesive generative and interpretive systems can achieve. The findings also open new avenues for advancements in cross-domain AI tasks that require sophisticated visual understanding and synthesis capabilities.

PDF Markdown

GitHub

GitHub - LTH14/mage: A PyTorch implementation of MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis (566 stars)