Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data (2306.09344v3)

Published 15 Jun 2023 in cs.CV and cs.LG

Abstract: Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. CarveKit. https://github.com/OPHoperHPO/image-background-remove-tool/.
  2. Image database tid2013: Peculiarities, results and perspectives. Signal Processing: Image Communication, 30:57–77, 2015.
  3. Using psychophysics to ask if the brain samples or maximizes. Journal of Vision, 15(3):7–7, 03 2015.
  4. Understanding and simplifying perceptual distances. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12226–12235, 2021.
  5. Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2(3):4, 2021.
  6. Synthetic data from diffusion models improves imagenet classification. ArXiv, abs/2304.08466, 2023.
  7. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  8. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  9. Instructpix2pix: Learning to follow image editing instructions, 2023.
  10. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  11. Patrick Cavanagh. The cognitive impenetrability of cognition. Behavioral and Brain Sciences, 22(3):370–371, 1999.
  12. Learning to generate line drawings that convey geometry and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7915–7925, 2022.
  13. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  14. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11(3), 2010.
  15. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 782–791, 2021.
  16. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  17. Reproducible scaling laws for contrastive language-image learning. ArXiv, abs/2212.07143, 2022.
  18. Katherine Crowson. k-diffusion. https://github.com/crowsonkb/k-diffusion, 2022.
  19. Michael R.W. Dawson. Cognitive Impenetrability, pages 1–3. Springer International Publishing, Cham, 2017.
  20. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
  21. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  22. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  23. Diffusion models beat gans on image synthesis. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 8780–8794. Curran Associates, Inc., 2021.
  24. Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence, 44(5):2567–2581, 2020.
  25. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  26. Generating images with perceptual similarity metrics based on deep networks. Advances in neural information processing systems, 29, 2016.
  27. Depth map prediction from a single image using a multi-scale deep network. In NIPS, 2014.
  28. Texture synthesis using convolutional neural networks. Advances in neural information processing systems, 28, 2015.
  29. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
  30. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  31. Masked autoencoders are scalable vision learners, 2021.
  32. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  33. Is synthetic data from generative models ready for image recognition? ArXiv, abs/2210.07574, 2022.
  34. Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior. Elife, 12:e82580, 2023.
  35. Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images. PloS one, 14(10):e0223792, 2019.
  36. Geodesics of learned representations. arXiv preprint arXiv:1511.06394, 2015.
  37. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021.
  38. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  39. Openclip, July 2021.
  40. Zero-shot text-guided object generation with dream fields. 2022.
  41. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5885–5894, 2021.
  42. William James. The Principles of Psychology, volume 1. Henry Holt, New York, 1890.
  43. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 694–711. Springer, 2016.
  44. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  45. E-lpips: robust perceptual image similarity via random transformation ensembles. arXiv preprint arXiv:1906.03973, 2019.
  46. Learning multiple layers of features from tiny images. 2009.
  47. Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems, 25, 01 2012.
  48. Do better imagenet classifiers assess perceptual similarity better? Transactions of Machine Learning Research, 2022.
  49. Kadid-10k: A large-scale artificially distorted iqa database. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3, 2019.
  50. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  51. A sequential theory of psychological discrimination. Psychometrika, 40:77–105, 1975.
  52. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5188–5196, 2015.
  53. A differentiable perceptual audio metric learned from just noticeable differences. arXiv preprint arXiv:2001.04460, 2020.
  54. Cdpam: Contrastive learning for perceptual audio similarity. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 196–200. IEEE, 2021.
  55. Hdr-vdp-2: A calibrated visual metric for visibility and quality predictions in all luminance conditions. ACM Transactions on graphics (TOG), 30(4):1–14, 2011.
  56. Serial retrieval processes in the recovery of order information. Journal of Experimental Psychology: General, 122:291–315, 1993.
  57. Respects for similarity. Psychological Review, 100:254–278, 04 1993.
  58. Clay Mullis. Clip guided diffusion. https://github.com/afiaka87/clip-guided-diffusion, 2022.
  59. Human alignment of neural network representations. arXiv preprint arXiv:2211.01201, 2022.
  60. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
  61. Feature visualization. Distill, 2017. https://distill.pub/2017/feature-visualization.
  62. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  63. Gan-supervised dense visual alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13470–13481, 2022.
  64. Image database tid2013. Image Commun., 30(C):57–77, jan 2015.
  65. Pieapp: Perceptual image-error assessment through pairwise preference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1808–1817, 2018.
  66. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  67. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  68. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.
  69. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
  70. Philippe Rochat. What is it like to be a newborn? In Shaun Gallagher, editor, The Oxford Handbook of the Self. Oxford University Press, 2011.
  71. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  72. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36479–36494. Curran Associates, Inc., 2022.
  73. Complex wavelet structural similarity: A new image similarity index. IEEE transactions on image processing, 18(11):2385–2401, 2009.
  74. The sketchy database: Learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (proceedings of SIGGRAPH), 2016.
  75. Fake it till you make it: Learning transferable representations from synthetic imagenet clones. 2022.
  76. Effects of perceptual and conceptual similarity in semantic priming. Psychological Research, 45(4):339–354, 1984.
  77. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2107–2116, 2017.
  78. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  79. Dustin Stokes. Cognitive penetrability of perception. Philosophy Compass, 8(7):646–663, 2013.
  80. Color indexing. International journal of computer vision, 7(1):11–32, 1991.
  81. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
  82. Contrastive multiview coding. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI, page 776–794, Berlin, Heidelberg, 2020. Springer-Verlag.
  83. Splicing vit features for semantic appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10748–10757, 2022.
  84. Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9446–9454, 2018.
  85. Conditional similarity networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 830–838, 2017.
  86. Stylegan2 distillation for feed-forward image manipulation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 170–186. Springer, 2020.
  87. Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022.
  88. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  89. Transformers: State-of-the-Art Natural Language Processing. pages 38–45. Association for Computational Linguistics, Oct. 2020.
  90. Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12863–12872, 2021.
  91. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  92. Fsim: A feature similarity index for image quality assessment. IEEE transactions on Image Processing, 20(8):2378–2386, 2011.
  93. Richard Zhang. Making convolutional networks shift-invariant again. In International conference on machine learning, pages 7324–7334. PMLR, 2019.
  94. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Stephanie Fu (11 papers)
  2. Netanel Tamir (1 paper)
  3. Shobhita Sundaram (7 papers)
  4. Lucy Chai (11 papers)
  5. Richard Zhang (61 papers)
  6. Tali Dekel (40 papers)
  7. Phillip Isola (84 papers)
Citations (68)

Summary

We haven't generated a summary for this paper yet.