Unleashing Text-to-Image Diffusion Models for Visual Perception (2303.02153v1)

Published 3 Mar 2023 in cs.CV

Abstract: Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with a pre-trained Diffusion model), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation and depth estimation demonstrates the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD

PDF Abstract

Overview of "Unleashing Text-to-Image Diffusion Models for Visual Perception"

The paper "Unleashing Text-to-Image Diffusion Models for Visual Perception" presents a novel framework, VPD (Visual Perception with a pre-trained Diffusion model), that leverages the high-level semantic knowledge from text-to-image diffusion models to enhance visual perception tasks. By utilizing the capabilities of diffusion models, pre-trained on extensive image-text datasets, the authors show significant improvements in tasks like semantic segmentation, referring image segmentation, and depth estimation.

Core Contributions

Transfer Learning Using Visual-LLMs: The paper exploits the semantic comprehension inherent in text-to-image models pre-trained with large datasets such as LAION-5B. This approach is contrasted with traditional visual pre-training methods that often rely on image-only data. The diffusion models, due to their vision-language training, offer a rich reservoir of high-level and low-level knowledge useful for downstream tasks.
Innovative Framework (VPD): VPD uses a pre-trained diffusion model’s denoising autoencoder as a backbone. The model repurposes the denoising tasks of diffusion models to enhance visual perception tasks by providing both implicit and explicit semantic guidance:
- Implicit Guidance: By crafting specific textual inputs, the model is prompted to interact effectively with the visual domain, facilitating comprehensive feature extraction.
- Explicit Guidance: Cross-attention maps between visual and text features are leveraged to offer semantic guidance, enhancing the model’s alignment with text prompts.
Performance and Adaptability: The framework demonstrates superior adaptability to downstream tasks compared to previous methods. VPD achieves new benchmarks, such as 0.254 RMSE on NYUv2 for depth estimation and 73.3% oIoU in RefCOCO-val referring image segmentation. The results underline its efficacy across tasks demanding varying levels of visual and semantic comprehension.

Noteworthy Results

Semantic Segmentation: VPD, with its Semantic FPN head, outperforms models like ConvNeXt-XL despite using a more lightweight configuration. The model achieves 54.6% mIoU with a shorter training schedule compared to other architectures that require extensive training iterations.
Referring Image Segmentation: On datasets like RefCOCO, VPD surpasses the state-of-the-art by effectively utilizing text prompts derived from the referring expressions, illustrating the benefits of visual-language interaction that diffusion models provide.
Depth Estimation: Establishing new state-of-the-art results on NYUv2, the VPD model not only improves accuracy metrics but also demonstrates faster convergence compared to models trained with conventional visual pre-training.

Implications and Future Directions

The success of VPD suggests that leveraging large text-to-image diffusion models could revolutionize training and performance benchmarks for visual perception tasks. The ability to harness both high-level semantic understanding and low-level image details makes these models highly versatile and efficient learners.

Future developments could focus on exploring lightweight models to address the computational costs associated with diffusion models, branching into other generative models beyond diffusion, and further integrating generative and recognition capabilities. Research could also delve into leveraging multi-modal data more extensively, potentially broadening the applicability of such models across various domains of AI and computer vision.

In conclusion, the paper presents a compelling case for the use of diffusion models as a robust alternative for visual perception tasks, paving the way for innovative approaches in integrating generative pre-training with perceptual tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Wenliang Zhao (22 papers)
Yongming Rao (50 papers)
Zuyan Liu (11 papers)
Benlin Liu (11 papers)
Jie Zhou (687 papers)
Jiwen Lu (192 papers)

Citations (165)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - wl-zhao/VPD: [ICCV 2023] VPD is a framework that leverages the high-level and low-level knowledge of a pre-trained text-to-image diffusion model to downstream visual perception tasks. (526 stars)

Tweets

YouTube

Show All Videos