Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Published 24 May 2023 in cs.CV | (2305.15347v2)

Abstract: Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Neural best-buddies: Sparse cross-domain correspondence. ACM Transitions on Graphics, 37(4):1–14, 2018.
  2. Deep ViT features as dense visual descriptors. In European Conference on Computer Vision Workshops, 2022.
  3. SegDiff: Image segmentation with diffusion probabilistic models. arXiv:2112.00390, 2021.
  4. Label-efficient semantic segmentation with diffusion models. In International Conference on Learning Representations, 2022.
  5. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, pages 9912–9924, 2020.
  6. Emerging properties in self-supervised vision transformers. In International Conference Computer Vision, pages 9650–9660, 2021.
  7. DiffusionDet: Diffusion model for object detection. In International Conference Computer Vision, 2023a.
  8. A generalist framework for panoptic segmentation of images and videos. In International Conference Computer Vision, 2023b.
  9. Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems, 34:9011–9023, 2021.
  10. Cats++: Boosting cost aggregation with convolutions and transformers. arXiv preprint arXiv:2202.06817, 2022.
  11. Text-to-image diffusion models are zero-shot classifiers. arXiv preprint arXiv:2303.15233, 2023.
  12. DiffEdit: Diffusion-based semantic image editing with mask guidance. In International Conference on Learning Representations, 2023.
  13. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, pages 8780–8794, 2021.
  14. DiffusionDepth: Diffusion denoising approach for monocular depth estimation. arXiv:2303.05021, 2023.
  15. D2-Net: A trainable CNN for joint description and detection of local features. In IEEE Conference on Computer Vision and Pattern Recogition, pages 8092–8101, 2019.
  16. Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recogition, pages 12873–12883, 2021.
  17. PAIR-Diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546, 2023.
  18. ASIC: Aligning sparse in-the-wild image collections. arXiv preprint arXiv:2303.16201, 2023.
  19. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
  20. Proposal Flow: Semantic correspondences from object proposals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(7):1711–1725, 2017.
  21. Semantic contours from inverse detectors. In International Conference Computer Vision, pages 991–998, 2011.
  22. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  23. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pages 6840–6851, 2020.
  24. Learning semantic correspondence with sparse annotations. In European Conference Computer Vision, pages 267–284, 2022.
  25. PARN: Pyramidal affine regression networks for dense semantic correspondence. In European Conference Computer Vision, pages 351–366, 2018.
  26. DifNet: Semantic segmentation by diffusion networks. In Advances in Neural Information Processing Systems, 2018.
  27. Imagic: Text-based real image editing with diffusion models. In IEEE Conference on Computer Vision and Pattern Recogition, 2023.
  28. Recurrent transformer networks for semantic correspondence. In Advances in Neural Information Processing Systems, 2018.
  29. Erik G. Learned-Miller. Data driven image models through continuous joint alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(2):236–250, 2005.
  30. PatchMatch-based neighborhood consensus for semantic correspondence. In IEEE Conference on Computer Vision and Pattern Recogition, pages 13153–13163, 2021.
  31. Your diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023.
  32. Jointly optimizing 3D model fitting and fine-grained classification. In European Conference Computer Vision, pages 466–480, 2014.
  33. SIFT Flow: Dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):978–994, 2010.
  34. Semantic correspondence as an optimal transport problem. In IEEE Conference on Computer Vision and Pattern Recogition, pages 4463–4472, 2020.
  35. Diffusion hyperfeatures: Searching through time and space for semantic correspondence. arXiv preprint arXiv:2305.14334, 2023.
  36. SPair-71k: A large-scale benchmark for semantic correspondence. arXiv:1908.10543, 2019.
  37. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In International Confernece on Machine Learning, 2022.
  38. Improved denoising diffusion probabilistic models. In International Confernece on Machine Learning, pages 8162–8171, 2021.
  39. Neural Congealing: Aligning images to a joint semantic atlas. In IEEE Conference on Computer Vision and Pattern Recogition, 2023.
  40. LF-Net: Learning local features from images. In Advances in Neural Information Processing Systems, 2018.
  41. DINOv2: Learning robust visual features without supervision. arXiv:2304.07193, 2023.
  42. GAN-supervised dense visual alignment. In IEEE Conference on Computer Vision and Pattern Recogition, pages 13470–13481, 2022.
  43. A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recogition, pages 724–732, 2016.
  44. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  45. R2D2: Reliable and repeatable detector and descriptor. In Advances in Neural Information Processing Systems, 2019.
  46. Convolutional neural network architecture for geometric matching. In IEEE Conference on Computer Vision and Pattern Recogition, pages 6148–6157, 2017.
  47. End-to-end weakly-supervised semantic alignment. In IEEE Conference on Computer Vision and Pattern Recogition, pages 6917–6925, 2018.
  48. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recogition, pages 10684–10695, 2022.
  49. Unsupervised joint object discovery and segmentation in internet images. In IEEE Conference on Computer Vision and Pattern Recogition, pages 1939–1946, 2013.
  50. Palette: Image-to-image diffusion models. In ACM SIGGRAPH, pages 1–10, 2022a.
  51. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, pages 36479–36494, 2022b.
  52. SuperGlue: Learning feature matching with graph neural networks. In IEEE Conference on Computer Vision and Pattern Recogition, pages 4938–4947, 2020.
  53. Monocular depth estimation using diffusion models. arXiv:2302.14816, 2023.
  54. Attentive semantic alignment with offset-aware correlation kernels. In European Conference Computer Vision, pages 349–364, 2018.
  55. Deep unsupervised learning using nonequilibrium thermodynamics. In International Confernece on Machine Learning, pages 2256–2265, 2015.
  56. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  57. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, pages 12438–12448, 2020.
  58. Semantic diffusion network for semantic segmentation. In Advances in Neural Information Processing Systems, pages 8702–8716, 2022.
  59. Joint recovery of dense correspondence and cosegmentation in two images. In IEEE Conference on Computer Vision and Pattern Recogition, pages 4246–4255, 2016.
  60. GLU-Net: Global-local universal network for dense flow and correspondences. In IEEE Conference on Computer Vision and Pattern Recogition, pages 6258–6268, 2020a.
  61. GOCor: Bringing globally optimized correspondence volumes into your neural network. In Advances in Neural Information Processing Systems, pages 14278–14290, 2020b.
  62. Learning accurate dense correspondences and when to trust them. In IEEE Conference on Computer Vision and Pattern Recogition, pages 5714–5724, 2021a.
  63. Warp consistency for unsupervised learning of dense correspondences. In International Conference Computer Vision, pages 10346–10356, 2021b.
  64. Probabilistic warp consistency for weakly-supervised semantic correspondences. In IEEE Conference on Computer Vision and Pattern Recogition, pages 8708–8718, 2022.
  65. Plug-and-play diffusion features for text-driven image-to-image translation. In IEEE Conference on Computer Vision and Pattern Recogition, 2023.
  66. DISK: Learning local features with policy gradient. In Advances in Neural Information Processing Systems, pages 14254–14265, 2020.
  67. Diffusion models for implicit image segmentation ensembles. arXiv:2112.03145, 2021.
  68. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In IEEE Conference on Computer Vision and Pattern Recogition, 2023.
  69. Paint by Example: Exemplar-based image editing with diffusion models. arXiv:2211.13227, 2022.
  70. Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2878–2890, 2012.
  71. LIFT: Learned invariant feature transform. In European Conference Computer Vision, pages 467–483, 2016.
  72. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  73. Unleashing text-to-image diffusion models for visual perception. arXiv:2303.02153, 2023.
Citations (117)

Summary

  • The paper introduces a novel fusion strategy that leverages Stable Diffusion’s rich spatial details with DINOv2’s precise semantic matches.
  • It demonstrates state-of-the-art performance on SPair-71k, PF-Pascal, and TSS datasets using an effective feature fusion method.
  • The study explores practical applications like image editing and instance swapping while addressing limitations such as low fused resolution and higher computational cost.

Exploiting Stable Diffusion and DINO Features for Semantic Correspondence

The paper "A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence" (2305.15347) introduces a novel approach to semantic correspondence by leveraging and fusing features from Stable Diffusion (SD) and DINOv2 models. It demonstrates that SD features, when combined with DINOv2 features, can achieve state-of-the-art results in zero-shot semantic correspondence tasks. The work highlights the complementary nature of these features, where SD provides high-quality spatial information and DINOv2 offers sparse but accurate semantic matches. The implications of this research span across various computer vision applications, including image editing, object swapping, and visual understanding.

Feature Analysis and Properties

The paper explores the properties of Stable Diffusion (SD) features for semantic correspondence. Stable Diffusion, known for its ability to generate high-quality images from text inputs, possesses a robust internal representation of images, capturing both content and layout. The architecture of SD comprises an encoder E\mathcal{E}, a decoder D\mathcal{D}, and a denoising U-Net U\mathcal{U}. The process involves projecting an input image into a latent space, adding Gaussian noise, and extracting features from the U-Net.

To understand the properties of these features, the authors perform principal component analysis (PCA) on various layers of the U-Net decoder. The analysis reveals that early layers focus more on semantics and structures, while later layers concentrate on detailed texture and appearance. K-means clustering further illustrates that different parts of objects are clustered and matched across instance pairs, indicating consistent semantic information across intra-class examples. This motivates the fusion of features from different levels to capture both semantic and local details. Figure 1

Figure 1: Analysis of features from different decoder layers in SD, visualizing PCA-computed features and K-Means clustering results.

The paper also examines the evolution of DINO features for semantic correspondence, noting the significant improvements of DINOv2 over DINOv1. Through experimentation, the "token" facet from the last layer of DINOv2 is found to yield the best performance. A comparison of SD and DINO features highlights their complementary strengths and weaknesses. SD features excel in spatial layout and generate smooth correspondences, but can sometimes be inaccurate at the pixel level. DINOv2, on the other hand, provides sparse but accurate matches.

Feature Fusion Strategy

The core contribution of this work is a simple yet effective strategy for aligning and fusing SD and DINOv2 features. The approach involves independently normalizing both features and then concatenating them:

$\mathcal{F}_{\text{FUSE} = (\alpha || \mathcal{F}_{\text{SD} ||_2,\ (1-\alpha) || \mathcal{F}_{\text{DINO} ||_2)}$

where ∣∣⋅∣∣2||\cdot||_2 denotes L2 normalization and α\alpha is a hyperparameter controlling the relative weight of the two features. Empirically, α=0.5\alpha=0.5 is found to provide a good balance. Figure 2

Figure 2: Semantic correspondence achieved by fusing Stable Diffusion and DINO features, enabling pixel-level instance swapping and plausible instance generation.

This fusion strategy leverages the strengths of both feature types, resulting in enhanced precision, reduced noise, and smoother transitions in correspondences. The fused features demonstrate superior performance in challenging cases, outperforming either feature alone.

Experimental Results

The paper presents extensive experiments on public benchmark datasets, including SPair-71k, PF-Pascal, and TSS. The zero-shot method (Fuse-ViT-B/14) significantly outperforms existing methods on the SPair-71k dataset, achieving a leading average PCK score. The fusion strategy also substantially improves the performance of the DINOv2 baseline, highlighting its effectiveness.

On the PF-Pascal dataset, the method consistently outperforms all unsupervised methods, achieving the highest average PCK scores. For dense correspondence evaluation on the TSS dataset, the fusion approach outperforms all unsupervised nearest-neighbor-search-based methods. These results confirm the effectiveness of the fusion strategy and the complementary nature of SD and DINOv2 features.

The paper also introduces a novel application of instance swapping, where objects in two images are swapped using estimated dense correspondence. By leveraging the high-quality dense correspondence obtained by the proposed method, plausible instance swapping is achieved through a straightforward process of pixel-level swapping and stable-diffusion-based refinement. Figure 3

Figure 3: Qualitative comparison of instance swapping with different features, showing that fused features balance spatial smoothness and fine details.

Limitations and Discussion

The paper acknowledges certain limitations, including the relatively low resolution of the fused features and the increased computational cost due to the integration of the Stable Diffusion model. The low resolution of the fused features can impede the construction of precise matches, which is particularly required in dense correspondence tasks.

The distinct behaviors observed in Stable Diffusion (SD) and DINOv2 features have spurred questions regarding the underlying causes. The training paradigms and architectural differences between DINO and SD may contribute to their differing properties. Identifying the causes of these variances remains an intriguing topic and is a great direction for future exploration. Figure 4

Figure 4: Semantic flow maps using different features, where SD features yield smoother flow fields compared to DINOv2's isolated outliers.

Conclusion

The paper "A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence" (2305.15347) makes a significant contribution by exploring the potential of Stable Diffusion features for semantic correspondence and demonstrating their complementary nature with DINOv2 features. The proposed fusion strategy and its empirical validation on multiple benchmark datasets highlight the benefits of pursuing better features for visual correspondence. The work opens up new avenues for research in image editing, object swapping, and visual understanding, leveraging the strengths of generative and discriminative models.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.