Controlling Vision-Language Models for Multi-Task Image Restoration (2310.01018v2)

Published 2 Oct 2023 in cs.CV

Abstract: Vision-LLMs such as CLIP have shown great impact on diverse downstream tasks for zero-shot or label-free predictions. However, when it comes to low-level vision such as image restoration their performance deteriorates dramatically due to corrupted inputs. In this paper, we present a degradation-aware vision-LLM (DA-CLIP) to better transfer pretrained vision-LLMs to low-level vision tasks as a multi-task framework for image restoration. More specifically, DA-CLIP trains an additional controller that adapts the fixed CLIP image encoder to predict high-quality feature embeddings. By integrating the embedding into an image restoration network via cross-attention, we are able to pilot the model to learn a high-fidelity image reconstruction. The controller itself will also output a degradation feature that matches the real corruptions of the input, yielding a natural classifier for different degradation types. In addition, we construct a mixed degradation dataset with synthetic captions for DA-CLIP training. Our approach advances state-of-the-art performance on both \emph{degradation-specific} and \emph{unified} image restoration tasks, showing a promising direction of prompting image restoration with large-scale pretrained vision-LLMs. Our code is available at https://github.com/Algolzw/daclip-uir.

References (57)

Citations (14)

View on Semantic Scholar

Summary

The paper presents DA-CLIP, a degradation-aware adaptation of CLIP that significantly improves multi-task image restoration performance.
It employs an auxiliary image controller and cross-attention mechanism to adjust encoder outputs for enhanced reconstruction fidelity.
Experimental results show superior performance over existing methods on both degradation-specific and unified image restoration tasks.

Overview of "Controlling Vision-LLMs for Multi-Task Image Restoration"

This paper addresses the challenge of adapting large-scale vision-LLMs (VLMs), such as CLIP, for multi-task image restoration. While VLMs have demonstrated remarkable capabilities in various high-level vision tasks, their application to low-level vision tasks like image restoration remains underexplored. The deterioration of performance in these tasks stems from the inability of VLMs to handle corrupted image inputs effectively.

Proposed Approach: DA-CLIP

The paper introduces a novel framework dubbed Degradation-Aware CLIP (DA-CLIP), designed to enhance image restoration capabilities by leveraging the pretrained strengths of VLMs. DA-CLIP integrates an additional controller with the existing CLIP architecture to adapt its image encoder. The main components and contributions include:

Image Controller:
- An auxiliary image controller is employed to modify the outputs of the CLIP image encoder. This controller predicts image degradation embeddings and adjusts the encoder output to produce high-quality content embeddings.
Cross-Attention Mechanism:
- The proposed framework enhances image restoration networks by integrating high-quality embeddings through a cross-attention mechanism, enabling improved reconstruction fidelity.
Degradation Feature Classifier:
- The controller outputs degradation features that act as a natural classifier to discern between different types of image degradation.
Dataset Construction:
- A mixed degradation dataset with synthetic captions has been curated, encompassing diverse degradation types such as blur, noise, and compression, among others.

Experimental Evaluation

The experimental evaluation of DA-CLIP demonstrates significant advancements over existing methods across both degradation-specific and unified image restoration tasks. Key results indicate that integrating DA-CLIP into diffusion-based models like IR-SDE leads to improved performance metrics, including PSNR, SSIM, LPIPS, and FID.

Degradation-specific Tasks: DA-CLIP showed superior perceptual and distortion metrics across various datasets compared to state-of-the-art methods.
Unified Image Restoration: DA-CLIP effectively handled multiple degradation types within a single model, outperforming other unified restoration frameworks such as AirNet and PromptIR.

Implications and Future Directions

The work offers substantial implications:

Practical Applications:
- Utilizing VLMs in low-level tasks can synergize the semantic understanding of high-level models to improve reconstruction tasks significantly.
Theoretical Insights:
- The paper further establishes the merit of cross-modal embeddings in tasks beyond mere classification, offering a versatile multimodal approach to various image processing challenges.

Future Prospects in AI

Future research could explore:

Applying this framework to real-world, mixed degradation scenarios to assess robustness and effectiveness.
Investigating the integration of alternative vision-LLMs to further validate and possibly surpass the results obtained with CLIP.
Enhancing the prompt learning module to refine multi-task performance even further.

In conclusion, the paper presents a compelling framework that not only bridges the gap between high-level and low-level vision tasks using VLMs but also sets a robust precedent for future advancements in the field of multi-task image restoration.

PDF Markdown

GitHub

GitHub - Algolzw/daclip-uir: PyTorch code for "Controlling Vision-Language Models for Universal Image Restoration", ICLR 2024. (564 stars)

Tweets

https://twitter.com/fregu856/status/1784844030347505909