Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
81 tokens/sec
Gemini 2.5 Pro Premium
33 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
22 tokens/sec
GPT-4o
78 tokens/sec
DeepSeek R1 via Azure Premium
92 tokens/sec
GPT OSS 120B via Groq Premium
436 tokens/sec
Kimi K2 via Groq Premium
209 tokens/sec
2000 character limit reached

Intriguing properties of generative classifiers (2309.16779v2)

Published 28 Sep 2023 in cs.CV, cs.AI, cs.LG, q-bio.NC, and stat.ML

Abstract: What is the best paradigm to recognize objects -- discriminative inference (fast but potentially prone to shortcut learning) or using a generative model (slow but potentially more robust)? We build on recent advances in generative modeling that turn text-to-image models into classifiers. This allows us to study their behavior and to compare them against discriminative models and human psychophysical data. We report four intriguing emergent properties of generative classifiers: they show a record-breaking human-like shape bias (99% for Imagen), near human-level out-of-distribution accuracy, state-of-the-art alignment with human classification errors, and they understand certain perceptual illusions. Our results indicate that while the current dominant paradigm for modeling human object recognition is discriminative inference, zero-shot generative models approximate human object recognition data surprisingly well.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Deep convolutional networks do not classify based on global object shape. PLoS Computational Biology, 14(12):e1006613, 2018.
  2. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision, pp.  456–473, 2018.
  3. Analysis by synthesis: a (re-) emerging program of research for language and vision. Biolinguistics, 4(2-3):174–200, 2010.
  4. Denoising Pretraining for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4175–4186, 2022.
  5. Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors. arXiv preprint arXiv:2211.13224, 2022.
  6. Muse: Text-To-Image Generation via Masked Generative Transformers. arXiv preprint arXiv:2301.00704, 2023.
  7. Text-to-image diffusion models are zero-shot classifiers. arXiv preprint arXiv:2303.15233, 2023.
  8. Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960.
  9. The Helmholtz machine. Neural Computation, 7(5):889–904, 1995.
  10. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pp. 7480–7512. PMLR, 2023.
  11. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  12. How does the brain combine generative models and direct discriminative computations in high-level vision? 2021.
  13. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2021.
  14. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
  15. Shortcut Learning in Deep Neural Networks. Nature Machine Intelligence, 2:665–673, 2020a.
  16. Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency. Advances in Neural Information Processing Systems, 33, 2020b.
  17. Partial success in closing the gap between human and machine vision. Advances in Neural Information Processing Systems, 34:23885–23899, 2021.
  18. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the IEEE International Conference on Computer Vision, pp.  1026–1034, 2015.
  19. The origins and prevalence of texture bias in convolutional neural networks. Advances in Neural Information Processing Systems, 33:19000–19015, 2020.
  20. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  21. Variational diffusion models. Advances in Neural Information Processing Systems, 34:21696–21707, 2021.
  22. Auto-encoding Variational Bayes. International Conference on Learning Representations, 2014.
  23. Nikolaus Kriegeskorte. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annual Review of Vision Science, 1:417–446, 2015.
  24. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105, 2012.
  25. Calvin Luo. Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970, 2022.
  26. On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, 14, 2001.
  27. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
  28. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  29. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  30. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Advances in Neural Information Processing Systems, 2022.
  31. Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations, 2018.
  32. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
  33. Improved techniques for training score-based generative models. Advances in Neural Information Processing Systems, 33:12438–12448, 2020.
  34. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  35. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp.  843–852, 2017.
  36. Hermann von Helmholtz. Handbuch der physiologischen Optik: mit 213 in den Text eingedruckten Holzschnitten und 11 Tafeln, volume 9. Voss, 1867.
  37. Are Deep Neural Networks Adequate Behavioral Models of Human Visual Perception? Annual Review of Vision Science, 9, 2023.
  38. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  39. Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences, 10(7):301–308, 2006.
  40. Unleashing Text-to-Image Diffusion Models for Visual Perception. arXiv preprint arXiv:2303.02153, 2023.
Citations (25)

Summary

  • The paper demonstrates that zero-shot generative classifiers exhibit a striking human-like shape bias and near-human-level accuracy in out-of-distribution tasks.
  • It employs state-of-the-art text-to-image models like Imagen, Stable Diffusion, and Parti, using Bayes' rule for object classification across 17 challenging OOD datasets.
  • The study finds that these models align closely with human error patterns and can even interpret perceptual illusions, challenging the dominance of texture-biased discriminative models.

An Analytical Perspective on Generative Classifiers and Their Alignment with Human Visual Perception

The presented paper investigates the competencies of generative classifiers in object recognition, with an emphasis on their alignment with human visual perception. The research contrasts discriminative and generative paradigms by leveraging recent advancements in text-to-image generative modeling to develop classifiers. The authors analyze how these generative classifiers perform in comparison to discriminative models and human psychophysical data, revealing significant insights into their emergent properties.

The paper identifies four emergent properties of zero-shot generative classifiers: a remarkable human-like shape bias, near-human-level accuracy in out-of-distribution (OOD) tasks, unparalleled alignment with human classification errors, and an understanding of certain perceptual illusions. These findings underscore the potential of generative models to approximate human object recognition more closely than the current dominant discriminative models.

Methodology and Comparative Analysis

The research utilizes prominent text-to-image generative models—namely, Stable Diffusion, Imagen, and Parti—to function as zero-shot classifiers across 17 challenging OOD datasets. These models are evaluated against an extensive benchmark of 52 discriminative models and human data, utilizing metrics like shape bias, OOD accuracy, and error consistency.

For classification, the generative classifiers interpret an image by estimating the most probable class assignment using Bayes' rule, conditioned on text prompts. The findings reveal that all three generative models demonstrate a shape bias comparable to humans, with Imagen exhibiting a near-perfect 99% shape bias. In contrast, many discriminative models remain predominantly texture-biased, highlighting a fundamental difference in the underlying representations captured by the two types of models.

Results: Alignment and Understanding of Human Perception

The generative models show near-human-level accuracy in OOD tasks, with Imagen and Stable Diffusion achieving impressive out-of-distribution robustness despite their zero-shot nature. They are not only accuracy-competent but also demonstrate state-of-the-art error consistency with human observers, particularly Imagen, which aligns most closely with human error patterns.

One notable highlight is their ability to understand perceptual illusions by reconstructing ambiguous images in a manner that aligns with human interpretations. This qualitative analysis illustrates their capability to capture higher-order semantic information, akin to human visual perception.

Implications and Future Considerations

The implications of this research extend to both theoretical and practical domains. Theoretically, it challenges the traditional focus on discriminative inference as the model of human object recognition, suggesting instead that generative models offer a compelling alternative with their robust representation of objects and understanding of visual nuances. Practically, the findings advocate for considering generative pre-training as a potent approach for enhancing computer vision task performance, especially in challenging and unforeseeable real-world scenarios.

The paper opens new avenues for integrating generative and discriminative processes, addressing what the authors refer to as "the deep mystery in vision." Future exploration could examine the architectural and training differences further, or experiment with hybrid approaches that leverage the strengths of both paradigms.

Conclusion

The investigation into generative classifiers reveals properties highly analogous to human perception, thus contributing significantly to understanding the potential of generative models in visual recognition. This research not only deepens the comprehension of generative classifiers' current capabilities but also sets the stage for future advancements in creating AI systems that are more human-like in their perceptual processes. Such progress would mark a pivotal step towards the development of robust, versatile AI across a spectrum of vision-based applications.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com