Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Controlling Vision-Language Models for Multi-Task Image Restoration (2310.01018v2)

Published 2 Oct 2023 in cs.CV

Abstract: Vision-LLMs such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-LLM (DA-CLIP) to better transfer pretrained vision-LLMs to low-level vision tasks as a multi-task framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a natural classifier for different degradation types. In addition, we construct a mixed degradation dataset with synthetic captions for DA-CLIP training. Our approach advances state-of-the-art performance on both \emph{degradation-specific} and \emph{unified} image restoration tasks, showing a promising direction of prompting image restoration with large-scale pretrained vision-LLMs. Our code is available at https://github.com/Algolzw/daclip-uir.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Ntire 2017 challenge on single image super-resolution: dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.  126–135, 2017.
  2. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18392–18402, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Gated context aggregation network for image dehazing and deraining. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp.  1375–1383. IEEE, 2019.
  5. Simple baselines for image restoration. In European Conference on Computer Vision, pp.  17–33. Springer, 2022.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  7. Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2021.
  8. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), volume 30, 2017.
  9. Denoising diffusion probabilistic models. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), volume 33, pp.  6840–6851, 2020.
  10. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp.  4904–4916. PMLR, 2021.
  11. Enlightengan: Deep light enhancement without paired supervision. IEEE Transactions on Image Processing, 30:2340–2349, 2021.
  12. Deblurgan: Blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  8183–8192, 2018.
  13. Deblurgan-v2: deblurring (orders-of-magnitude) faster and better. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  8878–8887, 2019.
  14. All-in-one image restoration for unknown corruption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  17452–17462, 2022a.
  15. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022b.
  16. Single image deraining: A comprehensive benchmark analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  3838–3847, 2019.
  17. Diffbir: Towards blind image restoration with generative diffusion prior. arXiv preprint arXiv:2308.15070, 2023.
  18. Griddehazenet: Attention-based multi-scale network for image dehazing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7314–7323, 2019.
  19. Desnownet: Context-aware deep network for snow removal. IEEE Transactions on Image Processing, 27(6):3064–3073, 2018.
  20. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  21. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11461–11471, 2022.
  22. Image restoration with mean-reverting stochastic differential equations. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp.  23045–23066. PMLR, 2023a.
  23. Refusion: Enabling large-size realistic image restoration with latent-space diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1680–1691, 2023b.
  24. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the 18th IEEE International Conference on Computer Vision (ICCV), volume 2, pp.  416–423. IEEE, 2001.
  25. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  3883–3891, 2017.
  26. NUWA-LIP: Language-guided image inpainting with defect-free VQGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14183–14192, 2023.
  27. Promptir: Prompting for all-in-one blind image restoration. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  28. Attentive generative adversarial network for raindrop removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  2482–2491, 2018.
  29. FFA-Net: Feature fusion attention network for single image dehazing. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  11908–11915, 2020.
  30. Deshadownet: A multi-context embedding deep network for shadow removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4067–4075, 2017.
  31. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp.  8748–8763. PMLR, 2021.
  32. Progressive image deraining networks: a better and simpler baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  3937–3946, 2019.
  33. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  34. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  35. H Sheikh. Live image quality assessment database release 2. http://live. ece. utexas. edu/research/quality, 2005.
  36. Vision transformers for single image dehazing. IEEE Transactions on Image Processing, 32:1927–1941, 2023.
  37. Contrastive multiview coding. In European Conference on Computer Vision, pp.  776–794. Springer, 2020.
  38. NTIRE 2017 challenge on single image super-resolution: methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.  114–125, 2017.
  39. Maxim: Multi-axis mlp for image processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5769–5780, 2022.
  40. Exploiting diffusion prior for real-world image super-resolution. arXiv preprint arXiv:2305.07015, 2023.
  41. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  1905–1914, 2021.
  42. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  43. Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560, 2018.
  44. Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5901–5910, 2022.
  45. Finding discriminative filters for specific degradations in blind super-resolution. Advances in Neural Information Processing Systems, 34:51–61, 2021.
  46. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1357–1366, 2017.
  47. Joint rain detection and removal from a single image with contextualized deep networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(6):1377–1393, 2019.
  48. Learning enriched features for real image restoration and enhancement. In Proceedings of the 16th European Conference on Computer Vision, pp.  492–511. Springer, 2020.
  49. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  14821–14831, 2021.
  50. Restormer: efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5728–5739, 2022.
  51. Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685, 2023.
  52. Beyond a Gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
  53. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4791–4800, 2021.
  54. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  55. Tip-adapter: Training-free adaption of clip for few-shot classification. In European Conference on Computer Vision, pp.  493–510. Springer, 2022.
  56. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  586–595, 2018.
  57. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
Citations (14)

Summary

  • The paper presents DA-CLIP, a degradation-aware adaptation of CLIP that significantly improves multi-task image restoration performance.
  • It employs an auxiliary image controller and cross-attention mechanism to adjust encoder outputs for enhanced reconstruction fidelity.
  • Experimental results show superior performance over existing methods on both degradation-specific and unified image restoration tasks.

Overview of "Controlling Vision-LLMs for Multi-Task Image Restoration"

This paper addresses the challenge of adapting large-scale vision-LLMs (VLMs), such as CLIP, for multi-task image restoration. While VLMs have demonstrated remarkable capabilities in various high-level vision tasks, their application to low-level vision tasks like image restoration remains underexplored. The deterioration of performance in these tasks stems from the inability of VLMs to handle corrupted image inputs effectively.

Proposed Approach: DA-CLIP

The paper introduces a novel framework dubbed Degradation-Aware CLIP (DA-CLIP), designed to enhance image restoration capabilities by leveraging the pretrained strengths of VLMs. DA-CLIP integrates an additional controller with the existing CLIP architecture to adapt its image encoder. The main components and contributions include:

  1. Image Controller:
    • An auxiliary image controller is employed to modify the outputs of the CLIP image encoder. This controller predicts image degradation embeddings and adjusts the encoder output to produce high-quality content embeddings.
  2. Cross-Attention Mechanism:
    • The proposed framework enhances image restoration networks by integrating high-quality embeddings through a cross-attention mechanism, enabling improved reconstruction fidelity.
  3. Degradation Feature Classifier:
    • The controller outputs degradation features that act as a natural classifier to discern between different types of image degradation.
  4. Dataset Construction:
    • A mixed degradation dataset with synthetic captions has been curated, encompassing diverse degradation types such as blur, noise, and compression, among others.

Experimental Evaluation

The experimental evaluation of DA-CLIP demonstrates significant advancements over existing methods across both degradation-specific and unified image restoration tasks. Key results indicate that integrating DA-CLIP into diffusion-based models like IR-SDE leads to improved performance metrics, including PSNR, SSIM, LPIPS, and FID.

  • Degradation-specific Tasks: DA-CLIP showed superior perceptual and distortion metrics across various datasets compared to state-of-the-art methods.
  • Unified Image Restoration: DA-CLIP effectively handled multiple degradation types within a single model, outperforming other unified restoration frameworks such as AirNet and PromptIR.

Implications and Future Directions

The work offers substantial implications:

  • Practical Applications:
    • Utilizing VLMs in low-level tasks can synergize the semantic understanding of high-level models to improve reconstruction tasks significantly.
  • Theoretical Insights:
    • The paper further establishes the merit of cross-modal embeddings in tasks beyond mere classification, offering a versatile multimodal approach to various image processing challenges.

Future Prospects in AI

Future research could explore:

  • Applying this framework to real-world, mixed degradation scenarios to assess robustness and effectiveness.
  • Investigating the integration of alternative vision-LLMs to further validate and possibly surpass the results obtained with CLIP.
  • Enhancing the prompt learning module to refine multi-task performance even further.

In conclusion, the paper presents a compelling framework that not only bridges the gap between high-level and low-level vision tasks using VLMs but also sets a robust precedent for future advancements in the field of multi-task image restoration.

X Twitter Logo Streamline Icon: https://streamlinehq.com