Learning A Low-Level Vision Generalist via Visual Task Prompt (2408.08601v1)

Published 16 Aug 2024 in cs.CV

Abstract: Building a unified model for general low-level vision tasks holds significant research and practical value. Current methods encounter several critical issues. Multi-task restoration approaches can address multiple degradation-to-clean restoration tasks, while their applicability to tasks with different target domains (e.g., image stylization) is limited. Methods like PromptGIP can handle multiple input-target domains but rely on the Masked Autoencoder (MAE) paradigm. Consequently, they are tied to the ViT architecture, resulting in suboptimal image reconstruction quality. In addition, these methods are sensitive to prompt image content and often struggle with low-frequency information processing. In this paper, we propose a Visual task Prompt-based Image Processing (VPIP) framework to overcome these challenges. VPIP employs visual task prompts to manage tasks with different input-target domains and allows flexible selection of backbone network suitable for general tasks. Besides, a new prompt cross-attention is introduced to facilitate interaction between the input and prompt information. Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks. Experimental results show that GenLV can successfully address a variety of low-level tasks, significantly outperforming existing methods both quantitatively and qualitatively. Codes are available at https://github.com/chxy95/GenLV.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces the VPIP framework that integrates a visual prompt encoder with a cross-attention mechanism to unify 30 low-level vision tasks.
Experimental results show that the GenLV model significantly outperforms state-of-the-art methods in image restoration and enhancement tasks.
The approach streamlines multi-task processing by effectively handling diverse input-target domains while reducing computational complexity.

Learning A Low-Level Vision Generalist via Visual Task Prompt

The paper entitled "Learning A Low-Level Vision Generalist via Visual Task Prompt" by Xiangyu Chen et al. introduces a framework called Visual task Prompt-based Image Processing (VPIP) aimed at training a single model capable of performing a variety of low-level vision tasks. The research highlighted several challenges faced by existing multi-task restoration approaches and proposes a novel method that addresses these challenges effectively.

Overview of the VPIP Framework

The VPIP framework includes three main components:

An end-to-end image processing network.
A prompt encoder sub-network.
An information interaction mechanism for task-specific processing.

The authors utilized the X-Restormer architecture for the main network, which is designed specifically for image restoration tasks. They leveraged input-target image pairs as visual prompts, allowing the model to represent different tasks effectively. This enables the model to manage tasks involving various input-target domains.

Key Innovations

Prompt Cross-Attention: The introduction of a new prompt cross-attention mechanism is a key innovation that facilitates interaction between input and prompt information, enhancing the ability to handle diverse tasks. This mechanism efficiently incorporates task-specific latent information into the main network, reducing the computational burden compared to traditional MAE-based models.

Task Representation via Visual Prompts: By using visual task prompts, the framework effectively addresses the challenge of varying input-target domains. This approach not only improves the robustness against prompt image content but also offers flexibility in selecting backbone networks suitable for different low-level vision tasks.

Experimental Evaluation and Results

The model was trained on 30 diverse tasks covering domains of image restoration, enhancement, edge detection, and stylization. Notable tasks include Gaussian noise reduction, JPEG compression artifact removal, low-light enhancement, underwater image correction, and various stylization techniques. The quantitative results show that the proposed model, GenLV, outperforms existing methods both quantitatively and qualitatively.

Image Restoration: In this domain, GenLV significantly outperformed other state-of-the-art methods including Real-ESRGAN and PromptIR. It achieved higher PSNR values for tasks such as Gaussian noise reduction, JPEG compression artifact removal, and image dehazing.
Image Enhancement: For enhancement tasks, GenLV demonstrated superior performance in tasks like low-light enhancement and underwater image correction, achieving the highest PSNR values compared to other models like Painter and PromptGIP.
Edge Detection and Stylization: The model also showed promising results in edge detection and stylization tasks, outperforming existing methods by a significant margin.

Implications and Future Directions

The research holds significant implications for the development of generalist models in low-level vision. By introducing the VPIP framework, the paper demonstrates that it is possible to train a single model to effectively handle a wide range of tasks, simplifying the practical application of low-level vision techniques.

Practically, the adoption of such a generalist model could streamline workflows in various domains like photography, video enhancement, and automated image correction systems. Theoretically, the work lays the groundwork for further exploration into integrating versatility and robustness into computer vision models.

Future research could address the limitations noted by the authors, such as handling out-of-distribution unseen tasks. Another promising direction would be scaling the model, both in terms of size and the diversity of tasks, to further enhance its generalization capabilities.

Conclusion

In conclusion, the paper presents a compelling framework for creating a low-level vision generalist model using visual task prompts. The strong empirical results across 30 diverse tasks, facilitated by innovations such as prompt cross-attention, underscore the potential and efficacy of this approach. This research not only advances the field of multi-task learning in low-level vision but also sets the stage for future developments towards more comprehensive and versatile vision models.

PDF Markdown

Related Papers

GitHub

GitHub - chxy95/GenLV: ACM MM2024 - Learning A Low-Level Vision Generalist via Visual Task Prompt (26 stars)

Tweets

https://twitter.com/_vztu/status/1825396327901004233

https://twitter.com/CSVisionPapers/status/1825606833895006617