Visual Enumeration is Challenging for Large-scale Generative AI (2402.03328v2)
Abstract: Humans can readily judge the number of objects in a visual scene, even without counting, and such a skill has been documented in many animal species and babies prior to language development and formal schooling. Numerical judgments are error-free for small sets, while for larger collections responses become approximate, with variability increasing proportionally to the target number. This response pattern is observed for items of all kinds, despite variation in object features (such as color or shape), suggesting that our visual number sense relies on abstract representations of numerosity. Here, we investigate whether large-scale generative AI systems have a human-like number sense, which should allow them to reliably name the number of objects in simple visual stimuli or generate images containing a target number of items in the 1-10 range. Surprisingly, most of the foundation models considered have a poor number sense: They make striking errors even with small numbers, the response variability does not increase in a systematic way, and the pattern of errors depends on object category. Only the most recent proprietary systems exhibit signatures of a visual number sense. Our findings demonstrate that having an intuitive visual understanding of number remains challenging for foundation models, which in turn might be detrimental to the perceptual grounding of numeracy that in humans is crucial for mathematical learning.
- International evaluation of an AI system for breast cancer screening. Nature, 577(7788):89–94, 2020.
- ChatGPT outperforms crowd-workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30), 2023.
- Stanislas Dehaene. The number sense: How the mind creates mathematics. OUP USA, 2011.
- Newborn infants perceive abstract numbers. Proceedings of the National Academy of Sciences, 106(25):10382–10385, 2009.
- Spontaneous non-verbal counting in toddlers. Developmental science, 19(2):329–337, 2016.
- Spontaneous perception of numerosity in humans. Nature communications, 7(1):12536, 2016.
- Does subitizing reflect numerical estimation? Psychological science, 19(6):607–614, 2008.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- ViLT: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594, 2021.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023.
- Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF conference on computer vision and pattern recognition, 2022.
- Using cognitive psychology to understand GPT-3. Proceedings of the National Academy of Sciences, 120(6):e2218523120, 2023.
- Nonverbal counting in humans: The psychophysics of number representation. Psychological science, 10(2):130–137, 1999.
- One, two, three, four, nothing more: An investigation of the conceptual sources of the verbal counting principles. Cognition, 105(2):395–438, 2007.
- Number-knower levels in young children: Insights from bayesian modeling. Cognition, 120(3):391–402, 2011.
- Stanislas Dehaene. The neural basis of the weber–fechner law: a logarithmic mental number line. Trends in cognitive sciences, 7(4):145–147, 2003.
- Non-verbal numerical cognition: From reals to integers. Trends in cognitive sciences, 4(2):59–65, 2000.
- Do estimates of numerosity really adhere to weber’s law? a reexamination of two case studies. Psychonomic Bulletin & Review, 28:158–168, 2021.
- Calibrating the mental number line. Cognition, 106(3):1221–1247, 2008.
- Emergence of a’visual number sense’in hierarchical generative models. Nature neuroscience, 15(2):194–196, 2012.
- Visual sense of number vs. sense of magnitude in humans and machines. Scientific reports, 10(1):10045, 2020.
- Learning numerosity representations with transformers: Number generation tasks and out-of-distribution generalization. Entropy, 23(7):857, 2021.
- Alberto Testolin. The challenge of modeling the acquisition of mathematical concepts. Frontiers in human neuroscience, 14:100, 2020.
- Alberto Testolin. Can neural networks do arithmetic? a survey on the elementary numerical skills of state-of-the-art deep learning models. arXiv preprint arXiv:2303.07735, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.