Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis (2211.09117v2)

Published 16 Nov 2022 in cs.CV

Abstract: Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Masked siamese networks for label-efficient learning. arXiv preprint arXiv:2204.07141, 2022.
  2. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  3. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  4. Large scale GAN training for high fidelity natural image synthesis. In Int. Conf. on Learning Representations (ICLR), 2019.
  5. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
  6. Emerging properties in self-supervised vision transformers. In Int. Conference on Computer Vision (ICCV), pages 9650–9660, 2021.
  7. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
  8. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  9. A simple framework for contrastive learning of visual representations. In icml, pages 1597–1607. PMLR, 2020.
  10. Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems, 33, 2020.
  11. Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
  12. Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566, 2020.
  13. An empirical study of training self-supervised vision transformers. In Int. Conference on Computer Vision (ICCV), pages 9640–9649, 2021.
  14. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  16. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  17. Large scale adversarial representation learning. Advances in neural information processing systems, 32, 2019.
  18. Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710, 2021.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. on Learning Representations (ICLR), 2021.
  21. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  22. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
  23. Generative adversarial nets. 2014.
  24. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
  25. Masked autoencoders are scalable vision learners. https://github.com/facebookresearch/mae, 2021.
  26. Masked autoencoders are scalable vision learners. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, June 2022.
  27. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  28. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
  29. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  30. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020.
  31. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
  32. Contrastive masked autoencoders are stronger vision learners. arXiv:2207.13532v1, 2022.
  33. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  34. Autoregressive image generation using residual quantization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  35. Making contrastive learning robust to shortcuts. arXiv preprint arXiv:2012.09962, 2020.
  36. The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training. arXiv preprint arXiv:2204.08227, 2022.
  37. Diverse image generation via self-conditioned gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14286–14295, 2020.
  38. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  39. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  40. High-fidelity image generation with fewer labels. In International conference on machine learning, pages 4183–4192. PMLR, 2019.
  41. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  42. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  43. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
  44. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  45. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  46. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  47. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  48. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  49. Can contrastive learning avoid shortcut solutions? Advances in neural information processing systems, 34:4974–4986, 2021.
  50. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  51. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  52. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  53. Score-based generative modeling through stochastic differential equations. In Int. Conf. on Learning Representations (ICLR), 2021.
  54. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  55. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
  56. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
  59. Mvp: Multimodality-guided visual pre-training. arXiv preprint arXiv:2203.05175, 2022.
  60. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
  61. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
  62. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
  63. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  64. Self-attention generative adversarial networks. In Int. Conference on Machine Learning (ICML), pages 7354–7363, 2019.
  65. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  66. Improved transformer for high-resolution gans. Advances in Neural Information Processing Systems, 34:18367–18380, 2021.
  67. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
Citations (110)

Summary

  • The paper introduces MAGE, which unifies image synthesis and representation learning via variable masking ratios to balance generative and recognition tasks.
  • It leverages semantic tokens from a vector-quantized GAN and an optional contrastive loss to enhance semantic robustness and image quality.
  • Experimental results on ImageNet-1K show a 9.10 FID for image generation and 78.9% linear probing accuracy, demonstrating state-of-the-art performance.

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

The paper "MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis" addresses the long-standing challenge of unifying generative modeling and representation learning within a singular architectural framework in computer vision. Typically, these models have been developed independently, potentially overlooking synergies that could enhance both tasks.

Core Contributions

MAGE, or MAsked Generative Encoder, is introduced as the first framework to effectively integrate state-of-the-art (SOTA) image generation capabilities with self-supervised representation learning. The central concept involves the innovative use of variable masking ratios in masked image modeling (MIM) pre-training. A high masking ratio supports generative model training, while a lower one facilitates representation learning, allowing both processes to occur under a unified framework.

Methodology and Architecture

The architecture utilizes semantic tokens derived from a vector-quantized GAN at both inputs and outputs, combined with strategic masking. This approach enables effective high-quality image generation and semantic-level representation learning. To refine representation quality, a contrastive loss is optionally incorporated at the encoder's output, enhancing the model's semantic robustness.

The encoder-decoder structure employed by MAGE allows the unification of both tasks through the systematic application of variable masking ratios during the training phase. Such a strategy takes advantage of the overlapping needs for high-level semantic understanding intrinsic to both generative and recognition-based tasks.

Results and Evaluation

The extensive evaluations demonstrate MAGE’s competitive performance. On ImageNet-1K, a single ViT-L model achieves a 9.10 FID in class-unconditional image generation and a 78.9% accuracy in linear probing. Notably, these figures reflect SOTA results in both image synthesis and representation learning. The proposed model achieves 11.11 FID with ViT-B, outperforming the prior best result of 20.68 by a substantial margin. Additionally, with weak augmentations, the model further reduces the FID score.

Implications and Future Work

The integration of generative modeling with representation learning implies potential practical applications in domains requiring both processes, such as enhanced photo-editing tools and advanced augmented reality systems. Theoretically, this research contributes to the understanding of how such tasks can reinforce one another once effectively unified.

Looking forward, expanding MAGE's applicability to datasets of broader scale, such as JFT300, may reveal further potential in effectively bridging visual comprehension and generation in artificial intelligence frameworks. Fine-tuning and optimizing the balance between masking ratios and adding contextual learning layers could enhance the robustness and applicability of future iterations of such models.

In conclusion, MAGE represents a significant step toward versatile, unified computer vision models capable of pushing the frontiers of what cohesive generative and interpretive systems can achieve. The findings also open new avenues for advancements in cross-domain AI tasks that require sophisticated visual understanding and synthesis capabilities.