A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence
Abstract: Text-to-image diffusion models have made significant advances in generating and editing high-quality images. As a result, numerous approaches have explored the ability of diffusion model features to understand and process single images for downstream tasks, e.g., classification, semantic segmentation, and stylization. However, significantly less is known about what these features reveal across multiple, different images and objects. In this work, we exploit Stable Diffusion (SD) features for semantic and dense correspondence and discover that with simple post-processing, SD features can perform quantitatively similar to SOTA representations. Interestingly, the qualitative analysis reveals that SD features have very different properties compared to existing representation learning features, such as the recently released DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide high-quality spatial information but sometimes inaccurate semantic matches. We demonstrate that a simple fusion of these two features works surprisingly well, and a zero-shot evaluation using nearest neighbors on these fused features provides a significant performance gain over state-of-the-art methods on benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that these correspondences can enable interesting applications such as instance swapping in two images.
- Neural best-buddies: Sparse cross-domain correspondence. ACM Transitions on Graphics, 37(4):1–14, 2018.
- Deep ViT features as dense visual descriptors. In European Conference on Computer Vision Workshops, 2022.
- SegDiff: Image segmentation with diffusion probabilistic models. arXiv:2112.00390, 2021.
- Label-efficient semantic segmentation with diffusion models. In International Conference on Learning Representations, 2022.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, pages 9912–9924, 2020.
- Emerging properties in self-supervised vision transformers. In International Conference Computer Vision, pages 9650–9660, 2021.
- DiffusionDet: Diffusion model for object detection. In International Conference Computer Vision, 2023a.
- A generalist framework for panoptic segmentation of images and videos. In International Conference Computer Vision, 2023b.
- Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems, 34:9011–9023, 2021.
- Cats++: Boosting cost aggregation with convolutions and transformers. arXiv preprint arXiv:2202.06817, 2022.
- Text-to-image diffusion models are zero-shot classifiers. arXiv preprint arXiv:2303.15233, 2023.
- DiffEdit: Diffusion-based semantic image editing with mask guidance. In International Conference on Learning Representations, 2023.
- Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, pages 8780–8794, 2021.
- DiffusionDepth: Diffusion denoising approach for monocular depth estimation. arXiv:2303.05021, 2023.
- D2-Net: A trainable CNN for joint description and detection of local features. In IEEE Conference on Computer Vision and Pattern Recogition, pages 8092–8101, 2019.
- Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recogition, pages 12873–12883, 2021.
- PAIR-Diffusion: Object-level image editing with structure-and-appearance paired diffusion models. arXiv preprint arXiv:2303.17546, 2023.
- ASIC: Aligning sparse in-the-wild image collections. arXiv preprint arXiv:2303.16201, 2023.
- Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
- Proposal Flow: Semantic correspondences from object proposals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(7):1711–1725, 2017.
- Semantic contours from inverse detectors. In International Conference Computer Vision, pages 991–998, 2011.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pages 6840–6851, 2020.
- Learning semantic correspondence with sparse annotations. In European Conference Computer Vision, pages 267–284, 2022.
- PARN: Pyramidal affine regression networks for dense semantic correspondence. In European Conference Computer Vision, pages 351–366, 2018.
- DifNet: Semantic segmentation by diffusion networks. In Advances in Neural Information Processing Systems, 2018.
- Imagic: Text-based real image editing with diffusion models. In IEEE Conference on Computer Vision and Pattern Recogition, 2023.
- Recurrent transformer networks for semantic correspondence. In Advances in Neural Information Processing Systems, 2018.
- Erik G. Learned-Miller. Data driven image models through continuous joint alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(2):236–250, 2005.
- PatchMatch-based neighborhood consensus for semantic correspondence. In IEEE Conference on Computer Vision and Pattern Recogition, pages 13153–13163, 2021.
- Your diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023.
- Jointly optimizing 3D model fitting and fine-grained classification. In European Conference Computer Vision, pages 466–480, 2014.
- SIFT Flow: Dense correspondence across scenes and its applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):978–994, 2010.
- Semantic correspondence as an optimal transport problem. In IEEE Conference on Computer Vision and Pattern Recogition, pages 4463–4472, 2020.
- Diffusion hyperfeatures: Searching through time and space for semantic correspondence. arXiv preprint arXiv:2305.14334, 2023.
- SPair-71k: A large-scale benchmark for semantic correspondence. arXiv:1908.10543, 2019.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In International Confernece on Machine Learning, 2022.
- Improved denoising diffusion probabilistic models. In International Confernece on Machine Learning, pages 8162–8171, 2021.
- Neural Congealing: Aligning images to a joint semantic atlas. In IEEE Conference on Computer Vision and Pattern Recogition, 2023.
- LF-Net: Learning local features from images. In Advances in Neural Information Processing Systems, 2018.
- DINOv2: Learning robust visual features without supervision. arXiv:2304.07193, 2023.
- GAN-supervised dense visual alignment. In IEEE Conference on Computer Vision and Pattern Recogition, pages 13470–13481, 2022.
- A benchmark dataset and evaluation methodology for video object segmentation. In IEEE Conference on Computer Vision and Pattern Recogition, pages 724–732, 2016.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- R2D2: Reliable and repeatable detector and descriptor. In Advances in Neural Information Processing Systems, 2019.
- Convolutional neural network architecture for geometric matching. In IEEE Conference on Computer Vision and Pattern Recogition, pages 6148–6157, 2017.
- End-to-end weakly-supervised semantic alignment. In IEEE Conference on Computer Vision and Pattern Recogition, pages 6917–6925, 2018.
- High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recogition, pages 10684–10695, 2022.
- Unsupervised joint object discovery and segmentation in internet images. In IEEE Conference on Computer Vision and Pattern Recogition, pages 1939–1946, 2013.
- Palette: Image-to-image diffusion models. In ACM SIGGRAPH, pages 1–10, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, pages 36479–36494, 2022b.
- SuperGlue: Learning feature matching with graph neural networks. In IEEE Conference on Computer Vision and Pattern Recogition, pages 4938–4947, 2020.
- Monocular depth estimation using diffusion models. arXiv:2302.14816, 2023.
- Attentive semantic alignment with offset-aware correlation kernels. In European Conference Computer Vision, pages 349–364, 2018.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Confernece on Machine Learning, pages 2256–2265, 2015.
- Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
- Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, pages 12438–12448, 2020.
- Semantic diffusion network for semantic segmentation. In Advances in Neural Information Processing Systems, pages 8702–8716, 2022.
- Joint recovery of dense correspondence and cosegmentation in two images. In IEEE Conference on Computer Vision and Pattern Recogition, pages 4246–4255, 2016.
- GLU-Net: Global-local universal network for dense flow and correspondences. In IEEE Conference on Computer Vision and Pattern Recogition, pages 6258–6268, 2020a.
- GOCor: Bringing globally optimized correspondence volumes into your neural network. In Advances in Neural Information Processing Systems, pages 14278–14290, 2020b.
- Learning accurate dense correspondences and when to trust them. In IEEE Conference on Computer Vision and Pattern Recogition, pages 5714–5724, 2021a.
- Warp consistency for unsupervised learning of dense correspondences. In International Conference Computer Vision, pages 10346–10356, 2021b.
- Probabilistic warp consistency for weakly-supervised semantic correspondences. In IEEE Conference on Computer Vision and Pattern Recogition, pages 8708–8718, 2022.
- Plug-and-play diffusion features for text-driven image-to-image translation. In IEEE Conference on Computer Vision and Pattern Recogition, 2023.
- DISK: Learning local features with policy gradient. In Advances in Neural Information Processing Systems, pages 14254–14265, 2020.
- Diffusion models for implicit image segmentation ensembles. arXiv:2112.03145, 2021.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In IEEE Conference on Computer Vision and Pattern Recogition, 2023.
- Paint by Example: Exemplar-based image editing with diffusion models. arXiv:2211.13227, 2022.
- Yi Yang and Deva Ramanan. Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2878–2890, 2012.
- LIFT: Learned invariant feature transform. In European Conference Computer Vision, pages 467–483, 2016.
- Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
- Unleashing text-to-image diffusion models for visual perception. arXiv:2303.02153, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.