Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer (2403.17004v1)

Published 25 Mar 2024 in cs.CV and cs.MM

Abstract: Diffusion Transformer (DiT) has emerged as the new trend of generative diffusion models on image generation. In view of extremely slow convergence in typical DiT, recent breakthroughs have been driven by mask strategy that significantly improves the training efficiency of DiT with additional intra-image contextual learning. Despite this progress, mask strategy still suffers from two inherent limitations: (a) training-inference discrepancy and (b) fuzzy relations between mask reconstruction & generative diffusion process, resulting in sub-optimal training of DiT. In this work, we address these limitations by novelly unleashing the self-supervised discrimination knowledge to boost DiT training. Technically, we frame our DiT in a teacher-student manner. The teacher-student discriminative pairs are built on the diffusion noises along the same Probability Flow Ordinary Differential Equation (PF-ODE). Instead of applying mask reconstruction loss over both DiT encoder and decoder, we decouple DiT encoder and decoder to separately tackle discriminative and generative objectives. In particular, by encoding discriminative pairs with student and teacher DiT encoders, a new discriminative loss is designed to encourage the inter-image alignment in the self-supervised embedding space. After that, student samples are fed into student DiT decoder to perform the typical generative diffusion task. Extensive experiments are conducted on ImageNet dataset, and our method achieves a competitive balance between training cost and generative capacity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Self-supervised learning from images with a joint-embedding predictive architecture. arXiv preprint arXiv:2301.08243, 2023.
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  3. All are worth words: A vit backbone for diffusion models. In CVPR, 2023.
  4. Beit: Bert pre-training of image transformers. In ICLR, 2022.
  5. Large scale gan training for high fidelity natural image synthesis. In ICLR, 2019.
  6. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  7. Video generation models as world simulators. 2024.
  8. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
  9. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  10. Maskgit: Masked generative image transformer. In CVPR, 2022.
  11. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  12. Controlstyle: Text-driven stylized image generation using diffusion priors. In ACM Multimedia, 2023a.
  13. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  14. An empirical study of training self-supervised vision transformers. In ICCV, 2021.
  15. Control3d: Towards controllable text-to-3d generation. In ACM Multimedia, 2023b.
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  17. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  19. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  20. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  21. Masked diffusion transformer is a strong image synthesizer. In ICCV, 2023.
  22. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, 2020.
  23. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023.
  24. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  25. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  26. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  27. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
  28. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  29. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  30. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  31. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022b.
  32. Scalable adaptive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022.
  33. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  34. Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
  35. Improved precision and recall metric for assessing generative models. In NeurIPS, 2019.
  36. Mage: Masked generative encoder to unify representation learning and image synthesis. In CVPR, 2023a.
  37. Contextual Transformer Networks for Visual Recognition. IEEE Trans. on PAMI, 2022.
  38. Scaling language-image pre-training via masking. In CVPR, 2023b.
  39. Stand-Alone Inter-Frame Attention in Video Models. In CVPR, 2022.
  40. Decoupled weight decay regularization. In ICLR, 2019.
  41. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In NeurIPS, 2022.
  42. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  43. Semantic-conditional diffusion networks for image captioning. In CVPR, 2023.
  44. Null-text inversion for editing real images using guided diffusion models. In CVPR, 2023.
  45. Generating images with sparse representations. In ICML, 2021.
  46. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  47. Improved denoising diffusion probabilistic models. In ICML, 2021.
  48. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, 2023.
  49. Scalable diffusion models with transformers. In ICCV, 2023.
  50. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
  51. Improving language understanding by generative pre-training. 2018.
  52. Learning transferable visual models from natural language supervision. In ICML, 2021.
  53. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  54. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  55. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  56. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  57. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  58. Progressive distillation for fast sampling of diffusion models. In ICLR, 2022.
  59. Improved techniques for training gans. In NIPS, 2016.
  60. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. In ICLR, 2017.
  61. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, 2022.
  62. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  63. Denoising diffusion implicit models. In ICLR, 2021a.
  64. Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189, 2023.
  65. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
  66. Improved techniques for training score-based generative models. In NeurIPS, 2020.
  67. Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
  68. Consistency models. In ICML, 2023.
  69. What makes for good views for contrastive learning? In NeurIPS, 2020.
  70. Conditional image generation with pixelcnn decoders. In NeurIPS, 2016.
  71. Neural discrete representation learning. In NIPS, 2017.
  72. Attention is all you need. In NIPS, 2017.
  73. Binary latent diffusion. In CVPR, 2023.
  74. Diffusion models as masked autoencoders. arXiv preprint arXiv:2304.03283, 2023.
  75. Group normalization. In ECCV, 2018.
  76. 3dstyle-diffusion: Pursuing fine-grained text-driven 3d stylization with 2d diffusion models. In ACM Multimedia, 2023.
  77. Your vit is secretly a hybrid discriminative-generative diffusion model. arXiv preprint arXiv:2208.07791, 2022.
  78. Dual vision transformer. IEEE Trans. on PAMI, 2023.
  79. Hiri-vit: Scaling vision transformer with high resolution inputs. arXiv preprint arXiv:2403.11999, 2024.
  80. Magvit: Masked generative video transformer. In CVPR, 2023.
  81. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.
  82. ibot: Image bert pre-training with online tokenizer. In ICLR, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Rui Zhu (138 papers)
  2. Yingwei Pan (77 papers)
  3. Yehao Li (35 papers)
  4. Ting Yao (127 papers)
  5. Zhenglong Sun (13 papers)
  6. Tao Mei (209 papers)
  7. Chang Wen Chen (58 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.