Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Return of Unconditional Generation: A Self-supervised Representation Generation Method (2312.03701v4)

Published 6 Dec 2023 in cs.CV

Abstract: Unconditional generation -- the problem of modeling data distribution without relying on human-annotated labels -- is a long-standing and fundamental challenge in generative models, creating a potential of learning from large-scale unlabeled data. In the literature, the generation quality of an unconditional method has been much worse than that of its conditional counterpart. This gap can be attributed to the lack of semantic information provided by labels. In this work, we show that one can close this gap by generating semantic representations in the representation space produced by a self-supervised encoder. These representations can be used to condition the image generator. This framework, called Representation-Conditioned Generation (RCG), provides an effective solution to the unconditional generation problem without using labels. Through comprehensive experiments, we observe that RCG significantly improves unconditional generation quality: e.g., it achieves a new state-of-the-art FID of 2.15 on ImageNet 256x256, largely reducing the previous best of 5.91 by a relative 64%. Our unconditional results are situated in the same tier as the leading class-conditional ones. We hope these encouraging observations will attract the community's attention to the fundamental problem of unconditional generation. Code is available at https://github.com/LTH14/rcg.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. All are worth words: a vit backbone for score-based diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
  3. Why are conditional generative models better than unconditional ones? arXiv preprint arXiv:2212.00362, 2022.
  4. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  5. High fidelity visualization of what your self-supervised representation knows about. arXiv preprint arXiv:2112.09164, 2021.
  6. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  7. Large scale GAN training for high fidelity natural image synthesis. In Int. Conf. on Learning Representations (ICLR), 2019.
  8. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018.
  9. Emerging properties in self-supervised vision transformers. In Int. Conference on Computer Vision (ICCV), pages 9650–9660, 2021.
  10. Instance-conditioned gan. Advances in Neural Information Processing Systems, 34:27517–27529, 2021.
  11. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  12. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
  13. A simple framework for contrastive learning of visual representations. In icml, pages 1597–1607. PMLR, 2020.
  14. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  15. Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems, 33, 2020.
  16. An empirical study of training self-supervised vision transformers. In Int. Conference on Computer Vision (ICCV), pages 9640–9649, 2021.
  17. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  18. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  19. Large scale adversarial representation learning. Advances in neural information processing systems, 32, 2019.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. on Learning Representations (ICLR), 2021.
  21. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural networks, 107:3–11, 2018.
  22. Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
  23. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
  24. Generative adversarial nets. 2014.
  25. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
  26. Masked autoencoders are scalable vision learners. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, June 2022.
  27. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
  28. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  29. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  30. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  31. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020.
  32. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022.
  33. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  34. Self-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18413–18422, June 2023.
  35. Contrastive masked autoencoders are stronger vision learners. arXiv:2207.13532v1, 2022.
  36. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  37. Autoregressive image generation using residual quantization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  38. Targeted supervised contrastive learning for long-tailed recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6918–6928, 2022.
  39. Mage: Masked generative encoder to unify representation learning and image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2142–2152, 2023.
  40. Diverse image generation via self-conditioned gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14286–14295, 2020.
  41. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  42. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  43. High-fidelity image generation with fewer labels. In International conference on machine learning, pages 4183–4192. PMLR, 2019.
  44. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
  45. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  46. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  47. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  48. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  49. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10619–10629, 2022.
  50. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  51. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  52. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  53. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  54. Dave Salvator. Nvidia developer blog. https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32, 2020.
  55. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  56. Score-based generative modeling through stochastic differential equations. In Int. Conf. on Learning Representations (ICLR), 2021.
  57. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  58. Addp: Learning general representations for image recognition and generation with alternating denoising diffusion process. arXiv preprint arXiv:2306.05423, 2023.
  59. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  60. Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  61. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
  62. Self-attention generative adversarial networks. In Int. Conference on Machine Learning (ICML), pages 7354–7363, 2019.
  63. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  64. Through-wall human pose estimation using radio signals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7356–7365, 2018.
  65. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
Citations (10)

Summary

  • The paper presents RCG, a framework that leverages a self-supervised image encoder and a diffusion-based representation generator to bolster unconditional image generation.
  • It achieves significant FID improvements on ImageNet, reducing scores (e.g., LDM-8 from 39.13 to 11.30) and outperforming class-conditional methods.
  • The framework’s modular design enables efficient training, low computational overhead, and seamless extension to conditional generation tasks.

The paper "Return of Unconditional Generation: A Self-supervised Representation Generation Method" (2312.03701) introduces Representation-Conditioned Generation (RCG), a framework designed to address the significant performance gap between unconditional and conditional image generation, particularly on complex datasets like ImageNet. The core idea is to leverage the rich semantic information captured by self-supervised representations without relying on human-annotated labels.

RCG decomposes the challenging task of unconditional image generation into two more manageable steps:

  1. Generate a semantic representation in a low-dimensional, compact representation space.
  2. Generate the image conditioned on this generated representation.

This framework consists of three key components:

  1. Pre-trained Self-supervised Image Encoder: An off-the-shelf encoder (e.g., MoCo v3 [chen2021empirical] pre-trained on ImageNet) maps images to a representation space. This encoder is fixed during the RCG training process. The representations are expected to be simple enough to model their distribution and semantically rich. The paper uses Vision Transformers (ViT) of various sizes (S, B, L) pre-trained with contrastive learning methods or even supervised learning. The output dimensionality of the projection head is found to be important, with 256 dimensions performing well in experiments. The representation for each image is normalized by its mean and variance.
  2. Representation Generator: This component learns to sample from the distribution of the self-supervised representations. The paper implements this using a lightweight diffusion model called a Representation Diffusion Model (RDM). The RDM backbone is a fully-connected network composed of multiple residual blocks. Each block includes LayerNorm [ba2016layer], SiLU [elfwing2018sigmoid], and linear layers. Timestep embeddings are incorporated into each block. The RDM is trained following the DDIM [song2020denoising] protocol to denoise representations mixed with Gaussian noise. During inference, it samples representations from noise using the DDIM sampling process. Being a fully-connected network operating on low-dimensional representations, the RDM has marginal computational overhead compared to the image generator. The ablation studies show that the RDM's performance benefits from increased depth and width up to a certain point (12 blocks, 1536 hidden dimensions are found effective) and sufficient training epochs.
  3. Image Generator: This component takes a representation (generated by the RDM or extracted from a real image) as conditioning information and generates an image. The RCG framework is flexible and can utilize various existing conditional image generative models. The paper demonstrates its effectiveness with different diffusion models like ADM [dhariwal2021diffusion], LDM [rombach2022high], DiT [peebles2023scalable], and a masked generative model like MAGE [li2023mage]. The image generator is trained to reconstruct or denoise an image conditioned on its self-supervised representation. During inference, it takes a representation sampled from the RDM and generates an image.

Training Process:

The framework involves training the representation generator and the image generator. The self-supervised encoder is pre-trained and kept frozen.

  • RDM Training: Images are passed through the fixed encoder to get representations. These representations are corrupted with noise according to a diffusion process. The RDM is trained to predict the original representation or noise from the noisy representation, conditioned on the diffusion timestep.
  • Image Generator Training: Images are processed by the fixed encoder to obtain representations. The image generator is trained to reconstruct the image from a corrupted version (e.g., masked or noisy), conditioned on the representation derived from the same image. For MAGE, this involves reconstructing masked tokens conditioned on the representation, which replaces a learned class token.

Inference Process:

  1. Sample a representation from the trained RDM.
  2. Feed this representation to the trained image generator.
  3. The image generator produces an image conditioned on this representation.

Implementation Details:

  • The self-supervised encoder is typically MoCo v3 ViT-B/L with a 256-dimensional projection head.
  • The RDM is a fully-connected network with ~63M parameters (for 12 blocks, 1536 hidden dim). It uses AdamW optimizer and is trained for up to 200 epochs with a batch size of 512. DDIM sampling uses 250 steps.
  • The Image Generator (e.g., MAGE-L) is trained conditioned on the representation. Training can take up to 800 epochs with large batch sizes (e.g., 4096) and AdamW optimizer. The representation is used as a conditioning signal, e.g., replacing the class token in MAGE. Data augmentation includes resizing, random cropping, and horizontal flipping.

Key Results and Practical Implications:

  • Improved Unconditional Generation: RCG significantly boosts the performance of existing image generators in the unconditional setting on ImageNet 256x256. For example, it reduces the FID for LDM-8 from 39.13 to 11.30, ADM from 26.21 to 6.24, DiT-XL/2 from 27.32 to 4.89, and MAGE-L from 7.04 to 3.44 (without guidance) or 2.15 (with guidance). This demonstrates that self-supervised representations are effective conditioning signals even without labels.
  • State-of-the-Art: RCG achieves a new state-of-the-art FID of 2.15 for unconditional generation on ImageNet 256x256, surpassing previous methods by a large margin.
  • Bridging the Gap: RCG's unconditional performance is shown to be competitive with or even outperform leading class-conditional generation methods on ImageNet, effectively closing the historical performance gap.
  • Computational Efficiency: RCG can achieve impressive results with lower overall training costs compared to training powerful generative models directly for unconditional generation. The lightweight RDM contributes minimally to the total training and inference cost.
  • Guidance: The representation conditioning allows for incorporating guidance mechanisms similar to classifier-free guidance used in conditional models. This further improves generation quality (e.g., FID drops from 3.44 to 2.15 for MAGE-L with guidance).
  • Class-Conditional Extension: RCG can be easily extended to class-conditional generation by training a conditional RDM that incorporates class embeddings. This allows specifying the class of the generated image without retraining the image generator, enabling efficient adaptation to labeled tasks.
  • Semantic Control: Representations provide a semantically smooth space for controlling generation. Interpolating between representations of two images results in generated images that smoothly transition between the semantics of the two original images.
  • Applicability Beyond Images: The approach of modeling representation distributions could potentially be extended to other modalities where human annotation is difficult or impossible.

Potential Limitations/Failure Cases:

Like other generative models trained on ImageNet, RCG can still face difficulties generating certain types of content, such as legible text, highly regular geometric shapes (e.g., keyboards, wheels), and realistic human faces, which are common failure modes.

In summary, RCG provides a practical and effective paradigm for high-quality unconditional image generation by leveraging self-supervised representations. It achieves state-of-the-art performance, bridges the gap with conditional models, is computationally efficient, and allows for guidance and easy extension to conditional tasks, opening up possibilities for leveraging large-scale unlabeled data for generative modeling.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com