Generating images of rare concepts using pre-trained diffusion models (2304.14530v3)
Abstract: Text-to-image diffusion models can synthesize high-quality images, but they have various limitations. Here we highlight a common failure mode of these models, namely, generating uncommon concepts and structured concepts like hand palms. We show that their limitation is partly due to the long-tail nature of their training data: web-crawled data sets are strongly unbalanced, causing models to under-represent concepts from the tail of the distribution. We characterize the effect of unbalanced training data on text-to-image models and offer a remedy. We show that rare concepts can be correctly generated by carefully selecting suitable generation seeds in the noise space, using a small reference set of images, a technique that we call SeedSelect. SeedSelect does not require retraining or finetuning the diffusion model. We assess the faithfulness, quality and diversity of SeedSelect in creating rare objects and generating complex formations like hand images, and find it consistently achieves superior performance. We further show the advantage of SeedSelect in semantic data augmentation. Generating semantically appropriate images can successfully improve performance in few-shot recognition benchmarks, for classes from the head and from the tail of the training data of diffusion models
- SpaText: Spatio-Textual Representation for Controllable Image Generation. CVPR.
- Synthetic Data from Diffusion Models Improves ImageNet Classification. ArXiv.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324.
- Meta-learning with differentiable closed-form solvers. ICLR.
- Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
- Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. SIGGRAPH.
- Multi-Level Semantic Feature Augmentation for One-Shot Learning. IEEE Transactions on Image Processing, 28: 4594–4605.
- Fine-grained Visual Classification with High-temperature Refinement and Background Suppression. ArXiv.
- Parametric Contrastive Learning. ICCV.
- ImageNet: A large-scale hierarchical image database. In cvpr.
- Imagenet: A large-scale hierarchical image database. In CVPR.
- Diffusion models beat gans on image synthesis. NeurIPS.
- Efron, B. 1992. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution.
- Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. ICLR.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. ICLR.
- Designing an Encoder for Fast Personalization of Text-to-Image Models. arXiv preprint arXiv:2302.12228.
- Is synthetic data from generative models ready for image recognition? ICLR.
- Denoising diffusion probabilistic models. NeurIPS.
- Classifier-free diffusion guidance. NeurIPS workshop on Deep Generative Models and Downstream Applications.
- Label Hallucination for Few-shot Classification. In AAAI.
- Decoupling Representation and Classifier for Long-Tailed Recognition. ICLR.
- Elucidating the design space of diffusion-based generative models. NeurIPS.
- Supervised contrastive learning. NeurIPS.
- Learning multiple layers of features from tiny images.
- MetaSAug: Meta Semantic Augmentation for Long-Tailed Visual Recognition. CVPR.
- LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models. arXiv preprint arXiv:2305.13655.
- Compositional visual generation with composable diffusion models. In ECCV.
- Design guidelines for prompt engineering text-to-image generative models. In ACM SIGCHI.
- Large-scale long-tailed recognition in an open world. In CVPR.
- A very preliminary analysis of dall-e 2. arXiv preprint arXiv:2204.13807.
- Reliable Fidelity and Diversity Metrics for Generative Models. Proceedings of Machine Learning Research. PMLR.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. NeurIPS.
- Training language models to follow instructions with human feedback. NeurIPS.
- Learning transferable visual models from natural language supervision. In ICML.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
- Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment. ArXiv.
- On gans and gmms. NeurIPS.
- High-resolution image synthesis with latent diffusion models. In CVPR.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI.
- FeLMi : Few shot Learning with hard Mixup. In NeurIPS.
- DiffAlign : Few-shot learning using diffusion based synthesis and alignment. ArXiv, abs/2212.05404.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. CVPR.
- Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles. ICML.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS.
- From Generalized Zero-Shot Learning to Long-Tail With Class Descriptors. In WACV.
- Distributional Robustness Loss for Long-tail Learning. ICCV.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. Data Centric AI NeurIPS Workshop.
- Baby steps towards few-shot learning with multiple semantics. Pattern Recognition Letters.
- Key-Locked Rank One Editing for Text-to-Image Personalization. SIGGRAPH.
- Thorndike, R. L. 1953. Who belongs in the family? Psychometrika.
- VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition. In European Conference on Computer Vision.
- MaxViT: Multi-Axis Vision Transformer. ECCV.
- The inaturalist species classification and detection dataset. In CVPR.
- Matching Networks for One Shot Learning. In NeurIPS.
- The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.
- Long-tailed Recognition by Routing Diverse Distribution-Aware Experts. ICLR.
- DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. arXiv preprint arXiv:2210.14896.
- Generating Representative Samples for Few-Shot Classification. In CVPR.
- SEGA: semantic guided attention on visual prototype for few-shot learning. In WACV.
- Few-Shot Learning via Embedding Adaptation With Set-to-Set Functions. In CVPR.
- Apache spark: a unified engine for big data processing. Communications of the ACM.
- DeepEMD: Few-Shot Image Classification With Differentiable Earth Mover’s Distance and Structured Classifiers. In CVPR.
- Adding Conditional Control to Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.05543.
- Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling. ECCV.
- The unreasonable effectiveness of deep features as a perceptual metric. In CVPR.
- Deep long-tailed learning: A survey. arXiv preprint arXiv:2110.04596.
- Learning to Prompt for Vision-Language Models. International Journal of Computer Vision.
- Dvir Samuel (12 papers)
- Rami Ben-Ari (21 papers)
- Simon Raviv (2 papers)
- Nir Darshan (14 papers)
- Gal Chechik (110 papers)