Denoising Autoregressive Representation Learning (2403.05196v2)
Abstract: In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.
- Learning representations by maximizing mutual information across views, 2019.
- Beit: Bert pre-training of image transformers, 2022.
- Mutual information neural estimation. In International conference on machine learning, pp. 531–540. PMLR, 2018.
- Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006.
- Language models are few-shot learners, 2020.
- Emerging properties in self-supervised vision transformers, 2021.
- Generative pretraining from pixels. 2020a.
- Chen, T. On the importance of noise scheduling for diffusion models, 2023.
- A simple framework for contrastive learning of visual representations, 2020b.
- Randaugment: Practical automated data augmentation with a reduced search space, 2019.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Diffusion models beat gans on image synthesis, 2021.
- Dieleman, S. Perspectives on diffusion, 2023. URL https://sander.ai/2023/07/20/perspectives.html.
- Unsupervised visual representation learning by context prediction. CoRR, abs/1505.05192, 2015. URL http://arxiv.org/abs/1505.05192.
- Large scale adversarial representation learning, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Scalable pre-training of large autoregressive image models, 2024.
- Decoder-only or encoder-decoder? interpreting language model as a regularized encoder-decoder, 2023.
- Made: Masked autoencoder for distribution estimation, 2015.
- Unsupervised representation learning by predicting image rotations, 2018.
- Bootstrap your own latent: A new approach to self-supervised learning, 2020.
- Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
- Masked autoencoders are scalable vision learners, 2021.
- beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.
- Classifier-free diffusion guidance, 2022.
- Denoising diffusion probabilistic models, 2020.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Training compute-optimal large language models, 2022.
- simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
- Soda: Bottleneck diffusion models for representation learning, 2023.
- Scaling laws for neural language models, 2020.
- The impact of positional encoding on length generalization in transformers, 2023.
- Auto-encoding variational bayes, 2022.
- Self-supervised learning with kernel dependence maximization, 2021.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021.
- Decoupled weight decay regularization, 2019.
- Improved denoising diffusion probabilistic models, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022.
- Unsupervised learning of visual representations by solving jigsaw puzzles, 2017.
- Image transformer, 2018.
- Improving language understanding by generative pre-training. 2018.
- Learning transferable visual models from natural language supervision, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- Zero-shot text-to-image generation, 2021.
- Stochastic backpropagation and approximate inference in deep generative models, 2014.
- High-resolution image synthesis with latent diffusion models, 2022.
- Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
- Photorealistic text-to-image diffusion models with deep language understanding, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
- How to train your vit? data, augmentation, and regularization in vision transformers, 2022.
- Roformer: Enhanced transformer with rotary position embedding, 2023.
- Ul2: Unifying language learning paradigms, 2023.
- Neural autoregressive distribution estimation, 2016.
- Pixel recurrent neural networks, 2016a.
- Conditional image generation with pixelcnn decoders, 2016b.
- Neural discrete representation learning, 2018.
- Representation learning with contrastive predictive coding, 2019.
- Attention is all you need, 2023.
- Diffusion models as masked autoencoders, 2023.
- Xlnet: Generalized autoregressive pretraining for language understanding, 2020.
- A large-scale study of representation learning with the visual task adaptation benchmark, 2020.
- Colorful image colorization, 2016.