Return of Unconditional Generation: A Self-supervised Representation Generation Method (2312.03701v4)

Published 6 Dec 2023 in cs.CV

Abstract: Unconditional generation -- the problem of modeling data distribution without relying on human-annotated labels -- is a long-standing and fundamental challenge in generative models, creating a potential of learning from large-scale unlabeled data. In the literature, the generation quality of an unconditional method has been much worse than that of its conditional counterpart. This gap can be attributed to the lack of semantic information provided by labels. In this work, we show that one can close this gap by generating semantic representations in the representation space produced by a self-supervised encoder. These representations can be used to condition the image generator. This framework, called Representation-Conditioned Generation (RCG), provides an effective solution to the unconditional generation problem without using labels. Through comprehensive experiments, we observe that RCG significantly improves unconditional generation quality: e.g., it achieves a new state-of-the-art FID of 2.15 on ImageNet 256x256, largely reducing the previous best of 5.91 by a relative 64%. Our unconditional results are situated in the same tier as the leading class-conditional ones. We hope these encouraging observations will attract the community's attention to the fundamental problem of unconditional generation. Code is available at https://github.com/LTH14/rcg.

References (65)

Citations (10)

View on Semantic Scholar

Summary

The paper presents RCG, a framework that leverages a self-supervised image encoder and a diffusion-based representation generator to bolster unconditional image generation.
It achieves significant FID improvements on ImageNet, reducing scores (e.g., LDM-8 from 39.13 to 11.30) and outperforming class-conditional methods.
The framework’s modular design enables efficient training, low computational overhead, and seamless extension to conditional generation tasks.

The paper "Return of Unconditional Generation: A Self-supervised Representation Generation Method" (2312.03701) introduces Representation-Conditioned Generation (RCG), a framework designed to address the significant performance gap between unconditional and conditional image generation, particularly on complex datasets like ImageNet. The core idea is to leverage the rich semantic information captured by self-supervised representations without relying on human-annotated labels.

RCG decomposes the challenging task of unconditional image generation into two more manageable steps:

Generate a semantic representation in a low-dimensional, compact representation space.
Generate the image conditioned on this generated representation.

This framework consists of three key components:

Pre-trained Self-supervised Image Encoder: An off-the-shelf encoder (e.g., MoCo v3 [chen2021empirical] pre-trained on ImageNet) maps images to a representation space. This encoder is fixed during the RCG training process. The representations are expected to be simple enough to model their distribution and semantically rich. The paper uses Vision Transformers (ViT) of various sizes (S, B, L) pre-trained with contrastive learning methods or even supervised learning. The output dimensionality of the projection head is found to be important, with 256 dimensions performing well in experiments. The representation for each image is normalized by its mean and variance.
Representation Generator: This component learns to sample from the distribution of the self-supervised representations. The paper implements this using a lightweight diffusion model called a Representation Diffusion Model (RDM). The RDM backbone is a fully-connected network composed of multiple residual blocks. Each block includes LayerNorm [ba2016layer], SiLU [elfwing2018sigmoid], and linear layers. Timestep embeddings are incorporated into each block. The RDM is trained following the DDIM [song2020denoising] protocol to denoise representations mixed with Gaussian noise. During inference, it samples representations from noise using the DDIM sampling process. Being a fully-connected network operating on low-dimensional representations, the RDM has marginal computational overhead compared to the image generator. The ablation studies show that the RDM's performance benefits from increased depth and width up to a certain point (12 blocks, 1536 hidden dimensions are found effective) and sufficient training epochs.
Image Generator: This component takes a representation (generated by the RDM or extracted from a real image) as conditioning information and generates an image. The RCG framework is flexible and can utilize various existing conditional image generative models. The paper demonstrates its effectiveness with different diffusion models like ADM [dhariwal2021diffusion], LDM [rombach2022high], DiT [peebles2023scalable], and a masked generative model like MAGE [li2023mage]. The image generator is trained to reconstruct or denoise an image conditioned on its self-supervised representation. During inference, it takes a representation sampled from the RDM and generates an image.

Training Process:

The framework involves training the representation generator and the image generator. The self-supervised encoder is pre-trained and kept frozen.

RDM Training: Images are passed through the fixed encoder to get representations. These representations are corrupted with noise according to a diffusion process. The RDM is trained to predict the original representation or noise from the noisy representation, conditioned on the diffusion timestep.
Image Generator Training: Images are processed by the fixed encoder to obtain representations. The image generator is trained to reconstruct the image from a corrupted version (e.g., masked or noisy), conditioned on the representation derived from the same image. For MAGE, this involves reconstructing masked tokens conditioned on the representation, which replaces a learned class token.

Inference Process:

Sample a representation from the trained RDM.
Feed this representation to the trained image generator.
The image generator produces an image conditioned on this representation.

Implementation Details:

The self-supervised encoder is typically MoCo v3 ViT-B/L with a 256-dimensional projection head.
The RDM is a fully-connected network with ~63M parameters (for 12 blocks, 1536 hidden dim). It uses AdamW optimizer and is trained for up to 200 epochs with a batch size of 512. DDIM sampling uses 250 steps.
The Image Generator (e.g., MAGE-L) is trained conditioned on the representation. Training can take up to 800 epochs with large batch sizes (e.g., 4096) and AdamW optimizer. The representation is used as a conditioning signal, e.g., replacing the class token in MAGE. Data augmentation includes resizing, random cropping, and horizontal flipping.

Key Results and Practical Implications:

Improved Unconditional Generation: RCG significantly boosts the performance of existing image generators in the unconditional setting on ImageNet 256x256. For example, it reduces the FID for LDM-8 from 39.13 to 11.30, ADM from 26.21 to 6.24, DiT-XL/2 from 27.32 to 4.89, and MAGE-L from 7.04 to 3.44 (without guidance) or 2.15 (with guidance). This demonstrates that self-supervised representations are effective conditioning signals even without labels.
State-of-the-Art: RCG achieves a new state-of-the-art FID of 2.15 for unconditional generation on ImageNet 256x256, surpassing previous methods by a large margin.
Bridging the Gap: RCG's unconditional performance is shown to be competitive with or even outperform leading class-conditional generation methods on ImageNet, effectively closing the historical performance gap.
Computational Efficiency: RCG can achieve impressive results with lower overall training costs compared to training powerful generative models directly for unconditional generation. The lightweight RDM contributes minimally to the total training and inference cost.
Guidance: The representation conditioning allows for incorporating guidance mechanisms similar to classifier-free guidance used in conditional models. This further improves generation quality (e.g., FID drops from 3.44 to 2.15 for MAGE-L with guidance).
Class-Conditional Extension: RCG can be easily extended to class-conditional generation by training a conditional RDM that incorporates class embeddings. This allows specifying the class of the generated image without retraining the image generator, enabling efficient adaptation to labeled tasks.
Semantic Control: Representations provide a semantically smooth space for controlling generation. Interpolating between representations of two images results in generated images that smoothly transition between the semantics of the two original images.
Applicability Beyond Images: The approach of modeling representation distributions could potentially be extended to other modalities where human annotation is difficult or impossible.

Potential Limitations/Failure Cases:

Like other generative models trained on ImageNet, RCG can still face difficulties generating certain types of content, such as legible text, highly regular geometric shapes (e.g., keyboards, wheels), and realistic human faces, which are common failure modes.

In summary, RCG provides a practical and effective paradigm for high-quality unconditional image generation by leveraging self-supervised representations. It achieves state-of-the-art performance, bridges the gap with conditional models, is computationally efficient, and allows for guidance and easy extension to conditional tasks, opening up possibilities for leveraging large-scale unlabeled data for generative modeling.

PDF Markdown

Related Papers

GitHub

GitHub - LTH14/rcg: PyTorch implementation of RCG https://arxiv.org/abs/2312.03701 (760 stars)

Tweets

https://twitter.com/hillbig/status/1840506317905215845

https://twitter.com/22146921/status/1733620814660997222

https://twitter.com/GAIS_jp/status/1841976544345883058

https://twitter.com/knishimae0531/status/1840596808072077721

YouTube

Show All Videos