Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Scaling Concept With Text-Guided Diffusion Models (2410.24151v1)

Published 31 Oct 2024 in cs.CV and cs.CL

Abstract: Text-guided diffusion models have revolutionized generative tasks by producing high-fidelity content from text descriptions. They have also enabled an editing paradigm where concepts can be replaced through text conditioning (e.g., a dog to a tiger). In this work, we explore a novel approach: instead of replacing a concept, can we enhance or suppress the concept itself? Through an empirical study, we identify a trend where concepts can be decomposed in text-guided diffusion models. Leveraging this insight, we introduce ScalingConcept, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements. To systematically evaluate our approach, we present the WeakConcept-10 dataset, where concepts are imperfect and need to be enhanced. More importantly, ScalingConcept enables a variety of novel zero-shot applications across image and audio domains, including tasks such as canonical pose generation and generative sound highlighting or removal.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Ledits++: Limitless image editing using text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8861–8870, 2024.
  2. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18392–18402, 2023.
  3. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
  4. Videocrafter1: Open diffusion models for high-quality video generation, 2023.
  5. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7310–7320, 2024.
  6. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021.
  7. Taming transformers for high-resolution image synthesis. In CVPR, pp.  12873–12883, 2021.
  8. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
  9. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  10. Renoise: Real image inversion through iterative noising. arXiv preprint arXiv:2403.14602, 2024.
  11. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
  12. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  13. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023.
  14. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  15. Video diffusion models. arXiv:2204.03458, 2022.
  16. Davis: High-quality audio-visual separation with generative diffusion models. arXiv preprint arXiv:2308.00122, 2023a.
  17. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023b.
  18. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In ICML, pp.  13916–13932, 2023c.
  19. An edit friendly ddpm noise space: Inversion and manipulations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12469–12478, 2024.
  20. Tero Karras. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  21. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  22. Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning. arXiv preprint arXiv:2401.17690, 2024.
  23. Multi-concept customization of text-to-image diffusion. In CVPR, pp.  1931–1941, 2023.
  24. Language-guided joint audio-visual editing via one-shot adaptation. arXiv preprint arXiv:2410.07463, 2024.
  25. Microsoft coco: Common objects in context. In ECCV, pp.  740–755, 2014.
  26. Audioldm: Text-to-audio generation with latent diffusion models. In ICML, pp.  21450–21474, 2023a.
  27. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
  28. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023c.
  29. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6038–6047, 2023.
  30. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In ICML, volume 162, pp.  16784–16804, 2022.
  31. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  32. Learning transferable visual models from natural language supervision. In ICML, pp.  8748–8763, 2021.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  34. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  35. High-resolution image synthesis with latent diffusion models. In CVPR, pp.  10684–10695, 2022.
  36. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pp.  22500–22510, 2023.
  37. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 35:36479–36494, 2022.
  38. Freeu: Free lunch in diffusion u-net. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4733–4743, 2024.
  39. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  40. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp.  2256–2265. PMLR, 2015.
  41. Denoising diffusion implicit models. In ICLR, 2020.
  42. Objectstitch: Generative object compositing. arXiv preprint arXiv:2212.00932, 2022.
  43. Objectstitch: Object compositing with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  18310–18319, June 2023.
  44. Cardiff: Video salient object ranking chain of thought reasoning for saliency prediction with diffusion. arXiv preprint arXiv:2408.12009, 2024.
  45. Audio-visual event localization in unconstrained videos. In ECCV, pp.  247–263, 2018.
  46. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
  47. Gan inversion: A survey. IEEE TPAMI, 45(3):3121–3138, 2022.
  48. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  49. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: