A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Published 24 May 2023 in cs.CV | (2305.15347v2)

Abstract: Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.

Abstract PDF HTML Upgrade to Chat

References (73)

Citations (117)

View on Semantic Scholar

Summary

The paper introduces a novel fusion strategy that leverages Stable Diffusion’s rich spatial details with DINOv2’s precise semantic matches.
It demonstrates state-of-the-art performance on SPair-71k, PF-Pascal, and TSS datasets using an effective feature fusion method.
The study explores practical applications like image editing and instance swapping while addressing limitations such as low fused resolution and higher computational cost.

Exploiting Stable Diffusion and DINO Features for Semantic Correspondence

The paper "A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence" (2305.15347) introduces a novel approach to semantic correspondence by leveraging and fusing features from Stable Diffusion (SD) and DINOv2 models. It demonstrates that SD features, when combined with DINOv2 features, can achieve state-of-the-art results in zero-shot semantic correspondence tasks. The work highlights the complementary nature of these features, where SD provides high-quality spatial information and DINOv2 offers sparse but accurate semantic matches. The implications of this research span across various computer vision applications, including image editing, object swapping, and visual understanding.

Feature Analysis and Properties

The paper explores the properties of Stable Diffusion (SD) features for semantic correspondence. Stable Diffusion, known for its ability to generate high-quality images from text inputs, possesses a robust internal representation of images, capturing both content and layout. The architecture of SD comprises an encoder $\mathcal{E}$ , a decoder $\mathcal{D}$ , and a denoising U-Net $\mathcal{U}$ . The process involves projecting an input image into a latent space, adding Gaussian noise, and extracting features from the U-Net.

To understand the properties of these features, the authors perform principal component analysis (PCA) on various layers of the U-Net decoder. The analysis reveals that early layers focus more on semantics and structures, while later layers concentrate on detailed texture and appearance. K-means clustering further illustrates that different parts of objects are clustered and matched across instance pairs, indicating consistent semantic information across intra-class examples. This motivates the fusion of features from different levels to capture both semantic and local details.

Figure 1: Analysis of features from different decoder layers in SD, visualizing PCA-computed features and K-Means clustering results.

The paper also examines the evolution of DINO features for semantic correspondence, noting the significant improvements of DINOv2 over DINOv1. Through experimentation, the "token" facet from the last layer of DINOv2 is found to yield the best performance. A comparison of SD and DINO features highlights their complementary strengths and weaknesses. SD features excel in spatial layout and generate smooth correspondences, but can sometimes be inaccurate at the pixel level. DINOv2, on the other hand, provides sparse but accurate matches.

Feature Fusion Strategy

The core contribution of this work is a simple yet effective strategy for aligning and fusing SD and DINOv2 features. The approach involves independently normalizing both features and then concatenating them:

$\mathcal{F}_{\text{FUSE} = (\alpha || \mathcal{F}_{\text{SD} ||_2,\ (1-\alpha) || \mathcal{F}_{\text{DINO} ||_2)}$

where $||\cdot||_2$ denotes L2 normalization and $\alpha$ is a hyperparameter controlling the relative weight of the two features. Empirically, $\alpha=0.5$ is found to provide a good balance.

Figure 2: Semantic correspondence achieved by fusing Stable Diffusion and DINO features, enabling pixel-level instance swapping and plausible instance generation.

This fusion strategy leverages the strengths of both feature types, resulting in enhanced precision, reduced noise, and smoother transitions in correspondences. The fused features demonstrate superior performance in challenging cases, outperforming either feature alone.

Experimental Results

The paper presents extensive experiments on public benchmark datasets, including SPair-71k, PF-Pascal, and TSS. The zero-shot method (Fuse-ViT-B/14) significantly outperforms existing methods on the SPair-71k dataset, achieving a leading average PCK score. The fusion strategy also substantially improves the performance of the DINOv2 baseline, highlighting its effectiveness.

On the PF-Pascal dataset, the method consistently outperforms all unsupervised methods, achieving the highest average PCK scores. For dense correspondence evaluation on the TSS dataset, the fusion approach outperforms all unsupervised nearest-neighbor-search-based methods. These results confirm the effectiveness of the fusion strategy and the complementary nature of SD and DINOv2 features.

The paper also introduces a novel application of instance swapping, where objects in two images are swapped using estimated dense correspondence. By leveraging the high-quality dense correspondence obtained by the proposed method, plausible instance swapping is achieved through a straightforward process of pixel-level swapping and stable-diffusion-based refinement.

Figure 3: Qualitative comparison of instance swapping with different features, showing that fused features balance spatial smoothness and fine details.

Limitations and Discussion

The paper acknowledges certain limitations, including the relatively low resolution of the fused features and the increased computational cost due to the integration of the Stable Diffusion model. The low resolution of the fused features can impede the construction of precise matches, which is particularly required in dense correspondence tasks.

The distinct behaviors observed in Stable Diffusion (SD) and DINOv2 features have spurred questions regarding the underlying causes. The training paradigms and architectural differences between DINO and SD may contribute to their differing properties. Identifying the causes of these variances remains an intriguing topic and is a great direction for future exploration.

Figure 4: Semantic flow maps using different features, where SD features yield smoother flow fields compared to DINOv2's isolated outliers.

Conclusion

The paper "A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence" (2305.15347) makes a significant contribution by exploring the potential of Stable Diffusion features for semantic correspondence and demonstrating their complementary nature with DINOv2 features. The proposed fusion strategy and its empirical validation on multiple benchmark datasets highlight the benefits of pursuing better features for visual correspondence. The work opens up new avenues for research in image editing, object swapping, and visual understanding, leveraging the strengths of generative and discriminative models.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Summary

Exploiting Stable Diffusion and DINO Features for Semantic Correspondence

Feature Analysis and Properties

Feature Fusion Strategy

Experimental Results

Limitations and Discussion

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (7)

Collections

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Summary

Exploiting Stable Diffusion and DINO Features for Semantic Correspondence

Feature Analysis and Properties

Feature Fusion Strategy

Experimental Results

Limitations and Discussion

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (7)

Collections