Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Denoising Autoregressive Representation Learning (2403.05196v2)

Published 8 Mar 2024 in cs.LG and cs.CV

Abstract: In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Learning representations by maximizing mutual information across views, 2019.
  2. Beit: Bert pre-training of image transformers, 2022.
  3. Mutual information neural estimation. In International conference on machine learning, pp. 531–540. PMLR, 2018.
  4. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006.
  5. Language models are few-shot learners, 2020.
  6. Emerging properties in self-supervised vision transformers, 2021.
  7. Generative pretraining from pixels. 2020a.
  8. Chen, T. On the importance of noise scheduling for diffusion models, 2023.
  9. A simple framework for contrastive learning of visual representations, 2020b.
  10. Randaugment: Practical automated data augmentation with a reduced search space, 2019.
  11. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  13. Diffusion models beat gans on image synthesis, 2021.
  14. Dieleman, S. Perspectives on diffusion, 2023. URL https://sander.ai/2023/07/20/perspectives.html.
  15. Unsupervised visual representation learning by context prediction. CoRR, abs/1505.05192, 2015. URL http://arxiv.org/abs/1505.05192.
  16. Large scale adversarial representation learning, 2019.
  17. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  18. Scalable pre-training of large autoregressive image models, 2024.
  19. Decoder-only or encoder-decoder? interpreting language model as a regularized encoder-decoder, 2023.
  20. Made: Masked autoencoder for distribution estimation, 2015.
  21. Unsupervised representation learning by predicting image rotations, 2018.
  22. Bootstrap your own latent: A new approach to self-supervised learning, 2020.
  23. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
  24. Masked autoencoders are scalable vision learners, 2021.
  25. beta-VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy2fzU9gl.
  26. Classifier-free diffusion guidance, 2022.
  27. Denoising diffusion probabilistic models, 2020.
  28. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  29. Training compute-optimal large language models, 2022.
  30. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  31. Soda: Bottleneck diffusion models for representation learning, 2023.
  32. Scaling laws for neural language models, 2020.
  33. The impact of positional encoding on length generalization in transformers, 2023.
  34. Auto-encoding variational bayes, 2022.
  35. Self-supervised learning with kernel dependence maximization, 2021.
  36. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, 2021.
  37. Decoupled weight decay regularization, 2019.
  38. Improved denoising diffusion probabilistic models, 2021.
  39. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022.
  40. Unsupervised learning of visual representations by solving jigsaw puzzles, 2017.
  41. Image transformer, 2018.
  42. Improving language understanding by generative pre-training. 2018.
  43. Learning transferable visual models from natural language supervision, 2021.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
  45. Zero-shot text-to-image generation, 2021.
  46. Stochastic backpropagation and approximate inference in deep generative models, 2014.
  47. High-resolution image synthesis with latent diffusion models, 2022.
  48. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
  49. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
  50. Deep unsupervised learning using nonequilibrium thermodynamics, 2015.
  51. How to train your vit? data, augmentation, and regularization in vision transformers, 2022.
  52. Roformer: Enhanced transformer with rotary position embedding, 2023.
  53. Ul2: Unifying language learning paradigms, 2023.
  54. Neural autoregressive distribution estimation, 2016.
  55. Pixel recurrent neural networks, 2016a.
  56. Conditional image generation with pixelcnn decoders, 2016b.
  57. Neural discrete representation learning, 2018.
  58. Representation learning with contrastive predictive coding, 2019.
  59. Attention is all you need, 2023.
  60. Diffusion models as masked autoencoders, 2023.
  61. Xlnet: Generalized autoregressive pretraining for language understanding, 2020.
  62. A large-scale study of representation learning with the visual task adaptation benchmark, 2020.
  63. Colorful image colorization, 2016.
Citations (3)

Summary

  • The paper introduces DARL, integrating autoregressive prediction with denoising diffusion to achieve near state-of-the-art visual representation learning using MSE loss.
  • The paper leverages a decoder-only Transformer enhanced by a novel 2D Rotary Positional Embedding to improve autoregressive modeling of image patches.
  • The paper demonstrates that scaling model size and extending training duration with optimized noise schedules bolsters both representation and generative capabilities.

Exploring DARL: A Unified Model for Visual Representation and Generation

Introduction

In the pursuit of enhancing the capabilities of generative pre-training in computer vision, this paper introduces Denoising Autoregressive Representation Learning (DARL), a novel approach that marries the strengths of autoregressive and denoising diffusion models within a unified architecture. DARL employs a decoder-only Transformer tasked with predicting image patches autoregressively. Remarkably, this work demonstrates that performance closely aligns with state-of-the-art solutions, even under simple training regimes such as using Mean Squared Error (MSE) loss. Furthermore, by incorporating diffusion-based objectives, DARL subtly enhances its generative prowess, signaling an exciting progression toward versatile models capable of both sophisticated visual perception and generation.

Key Contributions

The paper makes several noteworthy contributions to the field of visual representation learning:

  • Denoising Autoregressive Learning: DARL innovatively combines autoregressive prediction with denoising diffusion mechanisms. This hybrid approach enables robust visual representation learning, exhibiting near-parity with leading masked prediction models under fine-tuning evaluations.
  • Positional Encoding Insights: Through extensive experimentation, the paper underscores the efficacy of decomposed Rotary Positional Embedding (RoPE) for causal Transformers in visual tasks. This novel 2D RoPE outperforms traditional positional encoding schemes, particularly enhancing autoregressive models.
  • Model Scaling and Noise Schedules: The research presents an exploration into the impacts of model size, training duration, and noise schedules on learning outcomes. Findings reveal that larger models and longer trainings, along with optimized noise schedules, favorably influence model performance.
  • Efficacy of MSE and Diffusion Objectives: The paper compares MSE loss against diffusion objectives for pre-training. Remarkably, MSE alone yields strong performance; however, diffusion objectives further refine generative capabilities, especially with tailored noise schedules and extended training regimes.

Theoretical and Practical Implications

From a theoretical standpoint, DARL's architecture prompts a reconsideration of generative pre-training's potential in visual tasks. It showcases that a unified model can adeptly handle both representation learning and image generation without compromising on performance. Practically, this work paves the way for more flexible and generalizable visual models that can be fine-tuned to a variety of downstream tasks with minimal performance loss, thereby broadening the applicability of generative models in real-world scenarios.

Future research could delve into refining the noise schedule and extending the model's capabilities to encompass more complex, multi-modal tasks. The insights regarding positional encoding also open avenues for further enhancing the Transformer architecture's applicability across various data types beyond images.

Conclusion

DARL marks a significant step towards realizing generative pre-training's full potential in vision. By adeptly blending autoregressive prediction with denoising diffusion processes within a cohesive framework, DARL not only matches but in some instances, surpasses the capabilities of contemporary benchmarks in visual representation learning. This research, by shedding light on the interactions between different model components and training objectives, contributes foundational knowledge that will undoubtedly inform the development of more advanced, versatile generative models in the future.