Iterative Prompt Learning for Unsupervised Backlit Image Enhancement (2303.17569v2)

Published 30 Mar 2023 in cs.CV

Abstract: We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT, by exploring the potential of Contrastive Language-Image Pre-Training (CLIP) for pixel-level image enhancement. We show that the open-world CLIP prior not only aids in distinguishing between backlit and well-lit images, but also in perceiving heterogeneous regions with different luminance, facilitating the optimization of the enhancement network. Unlike high-level and image manipulation tasks, directly applying CLIP to enhancement tasks is non-trivial, owing to the difficulty in finding accurate prompts. To solve this issue, we devise a prompt learning framework that first learns an initial prompt pair by constraining the text-image similarity between the prompt (negative/positive sample) and the corresponding image (backlit image/well-lit image) in the CLIP latent space. Then, we train the enhancement network based on the text-image similarity between the enhanced result and the initial prompt pair. To further improve the accuracy of the initial prompt pair, we iteratively fine-tune the prompt learning framework to reduce the distribution gaps between the backlit images, enhanced results, and well-lit images via rank learning, boosting the enhancement performance. Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability, without requiring any paired data.

Citations (75)

View on Semantic Scholar

Summary

The paper introduces CLIP-LIT, an unsupervised method using iterative prompt learning to enhance backlit images without relying on paired data.
It leverages CLIP’s language-image capabilities to initialize and iteratively refine prompts, aligning text-image similarity at the pixel level.
Experimental results on BAID and Backlit300 datasets show that CLIP-LIT outperforms state-of-the-art methods using PSNR, SSIM, and LPIPS metrics.

Iterative Prompt Learning for Unsupervised Backlit Image Enhancement

The paper "Iterative Prompt Learning for Unsupervised Backlit Image Enhancement" presents an innovative approach to enhancing backlit images utilizing a novel method abbreviated as CLIP-LIT. This approach leverages the Contrastive Language-Image Pre-Training (CLIP) model to facilitate pixel-level image enhancement without the need for paired datasets. This paper outlines a comprehensive framework that incorporates prompt learning and iterative refinement, distinguishing it from other state-of-the-art light enhancement techniques.

The paper identifies the complexity of enhancing backlit images, often characterized by extreme luminance disparities that pose significant challenges for existing methods, both supervised and unsupervised. Unlike traditional approaches that rely heavily on labeled datasets or make extensive use of physical assumptions, the proposed method adopts an unsupervised paradigm. By exploring the rich priors present in a pre-trained CLIP model, the authors aim to mitigate the need for paired data, which is a notable advantage over existing methods that struggle with generalization.

Methodology and Key Innovations

Prompt Learning and Enhancement Network

The key innovation in this paper lies in its dual-stage method involving prompt learning. The first stage involves initializing prompts using latent codes derived from CLIP's image and text encoder. These initial prompts are then utilized to train an enhancement network by aligning text-image similarity in the CLIP embedding space. The second stage involves iterative refinement, where prompts are fine-tuned iteratively to narrow the distribution gaps among backlit, enhanced, and well-lit images, thus improving the robustness of enhancement performance.

The framework uniquely alternates the training phases between prompt refinement and network enhancement, thereby iteratively updating the prompts and network to derive better results that match well-lit conditions without over-correcting already well-lit regions. This iterative approach distinctly improves the capacity of the algorithm to generalize across different scenarios, particularly in handling heterogeneous lighting distributions.

Numerical and Experimental Results

The authors present extensive experimental validations on diverse datasets, such as the BAID test dataset and a newly collected Backlit300 dataset. The results indicate that CLIP-LIT surpasses the performance of various state-of-the-art methods in both qualitative and quantitative terms, measured through metrics like PSNR, SSIM, LPIPS, and the non-reference image quality metric MUSIQ. Its capacity to enhance backlit images without overexposure and to produce artifacts-free results is noteworthy.

A further highlight of the paper is its demonstration of generalization, where the model effectively enhanced image samples across a variety of previously unseen backlit scenes. This was not only corroborated by objective metrics but also through subjective evaluations, where CLIP-LIT's results were consistently preferred by human evaluators over the best-performing existing methods.

Implications and Future Directions

The implications of the CLIP-LIT method are wide-ranging. Practically, it offers a tool for image preprocessing applicable in consumer photography and autonomous vehicle vision systems where backlit conditions are frequent. Theoretically, the integration of CLIP with low-level vision tasks opens new avenues for future research, potentially encouraging further exploration in using vision-LLMs for tasks beyond high-level semantics, incorporating text-based guidance into pixel-level image restoration tasks.

Moreover, the methodology of iterative prompt learning provides an additional framework that can be applied to various other vision tasks requiring unsupervised fine-tuning without extensive labeled data. Future research could explore its extension to real-time applications or even further optimize the CLIP-LIT model to enhance its efficiency and reduce computational overhead, making it viable for edge computing environments.

In conclusion, the paper successfully demonstrates a complex yet effective unsupervised approach to solving the challenge of backlit image enhancement, pioneering a path for broader application of language-vision models in image processing tasks. The iterative learning strategy coupled with CLIP's language-image synergy presents a compelling case for the evolution of unsupervised learning frameworks in computer vision.

PDF Markdown

Related Papers

YouTube

Show All Videos