Unlocking Pre-trained Image Backbones for Semantic Image Synthesis (2312.13314v2)
Abstract: Semantic image synthesis, i.e., generating images from user-provided semantic label maps, is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling, the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation, but the image quality tends to suffer on large and diverse datasets. In this work, we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbone networks pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables, leading to more diverse generated images. Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes, surpassing recent diffusion models while requiring two orders of magnitude less compute for inference.
- Diffusion-based data augmentation for skin disease classification: Impact across original medical datasets to fully synthetic images. arXiv 2301.04802, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Spatext: Spatio-textual representation for controllable image generation. In CVPR, 2023.
- Synthetic data from diffusion models improves imagenet classification. TMLR, 2023.
- Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
- Few-shot semantic image synthesis with class affinity transfer. In CVPR, 2023.
- DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. PAMI, 40(4):834–848, 2018.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Zero-shot spatial layout conditioning for text-to-image diffusion models. In ICCV, 2023.
- Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
- Density estimation using Real NVP. In ICLR, 2017.
- Styleflow for content-fixed image to image translation. arXiv, 2207.01909, 2022.
- Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
- Generative adversarial nets. In NeurIPS, 2014.
- Feedback-guided data synthesis for imbalanced classification. arXiv, 2310.00158, 2023.
- Bridging nonlinearities and stochastic regularizers with Gaussian error linear units. arXiv, 1606.08415, 2016.
- GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Multimodal conditional image synthesis with product-of-experts GANs. In ECCV, 2022.
- Generative adversarial transformers. In ICML, 2021.
- Image-to-image translation with conditional adversarial networks. CVPR, 2017.
- Scaling up GANs for text-to-image synthesis. In CVPR, 2023.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
- Alias-free generative adversarial networks. In NeurIPS, 2021.
- Glow: Generative flow with invertible 1×\times×1 convolutions. In NeurIPS, 2018.
- Auto-encoding variational Bayes. In ICLR, 2014.
- Segment anything. arXiv preprint, 2023.
- Ensembling off-the-shelf models for GAN training. In CVPR, 2022.
- ViTGAN: Training GANs with vision transformers. In ICLR, 2022.
- Dual pyramid generative adversarial networks for semantic image synthesis. In BMVC, 2022.
- Focal loss for dense object detection. In ICCV, 2017.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Adaptive density estimation for generative models. In NeurIPS, 2019.
- On self-supervised image representations for gan evaluation. In ICLR, 2021.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- f-GAN: Training generative neural samplers using variational divergence minimization. In NeurIPS, 2016.
- Representation learning with contrastive predictive coding. arXiv, 1807.03748, 2019.
- Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
- Contrastive learning for unpaired image-to-image translation. In ECCV, 2020.
- Hierarchical text-conditional image generation with CLIP latents. arXiv preprint, 2204.06125, 2022.
- Generating diverse high-fidelity images with VQ-VAE-2. In NeurIPS, 2019.
- Enhancing photorealism enhancement. IEEE TPAMI, 45(2):1700–1715, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Improved techniques for training GANs. In NeurIPS, 2016.
- Projected GANs converge faster. In NeurIPS, 2021.
- You only need adversarial supervision for semantic image synthesis. In ICLR, 2021.
- Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- Rethinking the inception architecture for computer vision. In CVPR, 2016.
- EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
- NVAE: A deep hierarchical variational autoencoder. In NeurIPS, 2020.
- Pretraining is all you need for image-to-image translation. arXiv, 2205.12952, 2022a.
- High-resolution image synthesis and semantic manipulation with conditional GANs. In CVPR, 2018.
- Semantic image synthesis via diffusion models. arXiv preprint, 2207.00050, 2022b.
- Unified perceptual parsing for scene understanding. In ECCV, 2018.
- Dilated residual networks. In CVPR, 2017.
- Self-attention generative adversarial networks. In ICML, 2019.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
- X-Paste: Revisiting scalable copy-paste for instance segmentation using CLIP and StableDiffusion. In ICML, 2023.
- Tariq Berrada (3 papers)
- Jakob Verbeek (59 papers)
- Camille Couprie (24 papers)
- Karteek Alahari (48 papers)