Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks (2403.00644v4)

Published 1 Mar 2024 in cs.CV

Abstract: Diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. However, due to the randomness in the diffusion process, they often struggle with handling diverse low-level tasks that require details preservation. To overcome this limitation, we present a new Diff-Plugin framework to enable a single pre-trained diffusion model to generate high-fidelity results across a variety of low-level tasks. Specifically, we first propose a lightweight Task-Plugin module with a dual branch design to provide task-specific priors, guiding the diffusion process in preserving image content. We then propose a Plugin-Selector that can automatically select different Task-Plugins based on the text instruction, allowing users to edit images by indicating multiple low-level tasks with natural language. We conduct extensive experiments on 8 low-level vision tasks. The results demonstrate the superiority of Diff-Plugin over existing methods, particularly in real-world scenarios. Our ablations further validate that Diff-Plugin is stable, schedulable, and supports robust training across different dataset sizes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. Spatext: Spatio-textual representation for controllable image generation. In CVPR, pages 18370–18380, 2023.
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv, 2022.
  3. Demystifying mmd gans. In ICLR, 2018.
  4. Instructpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023.
  5. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607, 2020.
  6. Multi-label image recognition with graph convolutional networks. In CVPR, pages 5177–5186, 2019.
  7. Diffusion posterior sampling for general noisy inverse problems. In ICLR, 2023.
  8. Zero-shot spatial layout conditioning for text-to-image diffusion models. In ICCV, pages 2174–2183, 2023.
  9. Diffusion models beat gans on image synthesis. In NeurIPS, pages 8780–8794, 2021.
  10. Prompt tuning inversion for text-driven image editing using diffusion models. In ICCV, pages 7430–7440, 2023.
  11. Taming transformers for high-resolution image synthesis. In CVPR, pages 12873–12883, 2021.
  12. Generative diffusion prior for unified image restoration and enhancement. In CVPR, pages 9935–9946, 2023.
  13. A multi-task network for joint specular highlight detection and removal. In CVPR, pages 7752–7761, 2021.
  14. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  15. Vqfr: Blind face restoration with vector-quantized dictionary and parallel decoder. In ECCV, pages 126–143, 2022.
  16. Shadowdiffusion: When degradation prior meets diffusion model for shadow removal. In CVPR, pages 14049–14058, 2023.
  17. Lime: Low-light image enhancement via illumination map estimation. IEEE TIP, 26(2):982–993, 2016.
  18. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  19. Prompt-to-prompt image editing with cross attention control. In ICLR, 2022.
  20. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  21. Classifier-free diffusion guidance. arXiv, 2022.
  22. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
  23. Labeled faces in the wild: A database forstudying face recognition in unconstrained environments. In Technical report, University of Massachusetts, Amherst, 2007.
  24. Low-light image enhancement with wavelet-based diffusion models. TOG, 42(6):1–14, 2023.
  25. Diffusion models for zero-shot open-vocabulary segmentation. arXiv, 2023.
  26. A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410, 2019.
  27. Denoising diffusion restoration models. In NeurIPS, 2022.
  28. Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023.
  29. Multi-concept customization of text-to-image diffusion. In CVPR, pages 1931–1941, 2023.
  30. Contrast enhancement based on layered difference representation of 2d histograms. IEEE TIP, 22(12):5372–5384, 2013.
  31. Your diffusion model is secretly a zero-shot classifier. In ICCV, pages 2206–2217, 2023.
  32. Benchmarking single-image dehazing and beyond. IEEE TIP, 28(1):492–505, 2018.
  33. All-in-one image restoration for unknown corruption. In CVPR, pages 17452–17462, 2022.
  34. Diffbir: Towards blind image restoration with generative diffusion prior. arXiv, 2023.
  35. Visual instruction tuning. In NeurIPS, 2023.
  36. Desnownet: Context-aware deep network for snow removal. IEEE TIP, 27(6):3064–3073, 2018.
  37. Decoupled weight decay regularization. arXiv, 2017.
  38. Perceptual quality assessment for multi-exposure image fusion. IEEE TIP, 24(11):3345–3356, 2015.
  39. Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047, 2023.
  40. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv, 2023.
  41. Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, pages 3883–3891, 2017.
  42. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. PMLR, 2021.
  43. OpenAI. Chatgpt plugins: https://openai.com/blog/chatgpt-plugins. 2023a.
  44. OpenAI. Gpt-4 technical report. arXiv, 2023b.
  45. Restoring vision in adverse weather conditions with patch-based denoising diffusion models. IEEE TPAMI, 2023.
  46. Zero-shot image-to-image translation. In SIGGRAPH, pages 1–11, 2023.
  47. Promptir: Prompting for all-in-one blind image restoration. In NeurIPS, 2023.
  48. Unicontrol: A unified diffusion model for controllable visual generation in the wild. In NeurIPS, 2023.
  49. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, pages 5485–5551, 2020.
  51. Hierarchical text-conditional image generation with clip latents. arXiv, 2022.
  52. Multiscale structure guided diffusion for image deblurring. In ICCV, pages 10721–10733, 2023.
  53. Real-world blur dataset for learning and benchmarking deblurring algorithms. In ECCV, pages 184–201, 2020.
  54. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  55. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
  56. Palette: Image-to-image diffusion models. In SIGGRAPH, pages 1–10, 2022a.
  57. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, pages 36479–36494, 2022b.
  58. Image super-resolution via iterative refinement. IEEE TPAMI, 45(4):4713–4726, 2022c.
  59. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS, pages 25278–25294, 2022.
  60. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265, 2015.
  61. Denoising diffusion implicit models. In ICLR, 2021.
  62. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
  63. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930, 2023.
  64. On the evaluation of illumination compensation algorithms. MTA, 77:9211–9231, 2018.
  65. Edict: Exact diffusion inversion via coupled transformations. In CVPR, pages 22532–22541, 2023.
  66. Exploiting diffusion prior for real-world image super-resolution. arXiv, 2023a.
  67. Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE TIP, 22(9):3538–3548, 2013.
  68. Spatial attentive single-image deraining with a high quality real rain dataset. In CVPR, pages 12270–12279, 2019.
  69. Towards real-world blind face restoration with generative facial prior. In CVPR, pages 9168–9178, 2021.
  70. Zero-shot image restoration using denoising diffusion null-space model. In ICLR, 2022.
  71. Dr2: Diffusion-based robust degradation remover for blind face restoration. In CVPR, pages 1704–1713, 2023b.
  72. Deep retinex decomposition for low-light enhancement. In BMVC, 2018.
  73. Deblurring via stochastic refinement. In CVPR, pages 16293–16303, 2022.
  74. Diffir: Efficient diffusion model for image restoration. In ICCV, pages 13095–13105, 2023.
  75. Plug-and-play document modules for pre-trained models. In ACL, 2023.
  76. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV, pages 7452–7461, 2023.
  77. Small models are valuable plug-ins for large language models. arXiv, 2023a.
  78. Prompt-free diffusion: Taking” text” out of text-to-image diffusion models. arXiv, 2023b.
  79. Implicit neural representation for cooperative low-light image enhancement. In ICCV, pages 12918–12927, 2023.
  80. Deep joint rain detection and removal from a single image. In CVPR, pages 1357–1366, 2017.
  81. Perceiving and modeling density for image dehazing. In ECCV, pages 130–145, 2022.
  82. Adverse weather removal with codebook priors. In ICCV, pages 12653–12664, 2023.
  83. Diff-retinex: Rethinking low-light image enhancement with a generative diffusion model. In CVPR, pages 12302–12311, 2023.
  84. Towards efficient and scale-robust ultra-high-definition image demoiréing. In ECCV, pages 646–662, 2022.
  85. Aim 2019 challenge on image demoireing: Methods and results. In ICCVW, pages 3534–3545, 2019.
  86. Multi-stage progressive image restoration. In CVPR, pages 14821–14831, 2021.
  87. Restormer: Efficient transformer for high-resolution image restoration. In CVPR, pages 5728–5739, 2022.
  88. Deep dense multi-scale network for snow removal using semantic and depth priors. IEEE TIP, 30:7419–7431, 2021.
  89. Magicbrush: A manually annotated dataset for instruction-guided image editing. In NeurIPS, 2023a.
  90. Adding conditional control to text-to-image diffusion models. In ICCV, pages 3836–3847, 2023b.
  91. Inversion-based style transfer with diffusion models. In CVPR, pages 10146–10156, 2023c.
  92. A unified conditional framework for diffusion-based image restoration. In NeurIPS, 2023d.
  93. Sine: Single image editing with text-to-image diffusion models. In CVPR, pages 6027–6037, 2023e.
  94. Uni-controlnet: All-in-one control to text-to-image diffusion models. In NeurIPS, 2023a.
  95. Towards authentic face restoration with iterative diffusion models and beyond. In ICCV, pages 7312–7322, 2023b.
  96. Generative prompt model for weakly supervised object localization. In ICCV, pages 6351–6361, 2023c.
  97. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024.
  98. Learning weather-general and weather-specific features for image restoration under multiple adverse weather conditions. In CVPR, pages 21747–21758, 2023.
Citations (11)

Summary

  • The paper introduces a Diff-Plugin framework that injects task-specific priors into pre-trained diffusion models for enhanced low-level vision performance.
  • It employs a dual-branch Task-Plugin module and a contrastive learning-based Plugin-Selector to preserve spatial details using natural language instructions.
  • Empirical evaluations show significant improvements in image fidelity and scalability, setting a new benchmark for task-specific image synthesis.

Enhancing Low-level Vision Tasks with Diff-Plugin: A Novel Framework for Pre-trained Diffusion Models

Introduction

The advent of diffusion models has heralded a new era in the field of image synthesis, owing to their unparalleled prowess in generating high-fidelity images. These models, trained on extensive datasets, demonstrate a remarkable understanding of various visual attributes and have been adapted for a myriad of downstream tasks. However, their application in low-level vision tasks has been hampered by the inherent randomness of the diffusion process, which often results in content distortion. To address this challenge, we introduce the novel Diff-Plugin framework, designed to empower a single pre-trained diffusion model to excel across diverse low-level vision tasks without sacrificing its generative capabilities.

Key Contributions

  • Diff-Plugin Framework: At the heart of our approach lies the Diff-Plugin framework, a pioneering solution that seamlessly integrates with pre-trained diffusion models to bolster their performance in low-level vision tasks. By employing a Task-Plugin module and a Plugin-Selector, Diff-Plugin offers an elegant way to inject task-specific priors and facilitate user-driven task selection through natural language inputs.
  • Task-Plugin Module: This lightweight, dual-branch module is the linchpin of our framework, extracting and leveraging task-specific priors to guide the diffusion process. It comprises a Task-Prompt Branch (TPB) for distilling task guidance information and a Spatial Complement Branch (SCB) for preserving spatial details, thus ensuring content fidelity.
  • Plugin-Selector: A novel addition that enhances the user-friendliness of our framework, the Plugin-Selector enables the dynamic selection of Task-Plugins based on textual instructions. It leverages a contrastive learning approach to align visual embeddings with task-specific text inputs, making the framework robust and versatile.

Theoretical and Practical Implications

The Diff-Plugin framework introduces a significant advancement in the domain of low-level vision tasks, demonstrating substantial improvements over existing methods. By retaining the generative capacity of pre-trained diffusion models while ensuring high-fidelity detail preservation, our approach sets a new benchmark for task-specific image synthesis. Moreover, the ability to harness textual instructions for task selection opens up new avenues for intuitive, user-centric image editing.

Notably, our methodology exhibits remarkable scalability and adaptability across different datasets, showcasing its effectiveness in real-world scenarios. This characteristic signifies a forward leap towards the development of generalized models capable of tackling a wide array of low-level vision tasks efficiently.

Future Outlook

While our framework marks a significant stride in the application of diffusion models to low-level vision tasks, it also paves the way for further explorations. One area of potential development is the incorporation of locality-sensitive editing capabilities, enabling precise manipulations within specific image regions. Additionally, the integration of LLMs could further refine the interaction between text-driven task specifications and visual output generation, enhancing the accuracy and user experience of model-guided image editing.

Conclusion

In essence, the Diff-Plugin framework embodies a pivotal progression in the field of generative models, especially in catering to the nuanced requirements of low-level vision tasks. By merging the generative prowess of diffusion models with task-specific detail preservation and intuitive text-based task selection, our approach not only broadens the applicational scope of these models but also enriches the landscape of image synthesis research. As we continue to explore and refine this innovative framework, the horizon of possibilities in image editing and synthesis continues to expand, promising exciting developments for the future of artificial intelligence in visual media.

X Twitter Logo Streamline Icon: https://streamlinehq.com