Scaling the dual-stream model to naturalistic images

Determine whether the dual-stream recurrent neural network that integrates foveated glimpse contents (ventral stream) and gaze positions (dorsal stream), learns a spatial target map, and reads out numerosity can be used to help counting in naturalistic images, such as 3D tabletop scenes.

Background

The paper introduces a biologically inspired dual-stream recurrent neural network that processes foveated glimpses of an image (what) alongside their spatial positions (where). Trained on synthetic letter arrays, the model achieves robust zero-shot generalization in counting across out-of-distribution shapes and luminances, forms spatial response fields and log-normal numerosity codes reminiscent of macaque posterior parietal cortex, and predicts patterns of human performance under free vs. fixed gaze.

While successful on controlled synthetic stimuli, the authors explicitly leave open whether this approach can be applied to more complex, naturalistic visual inputs, such as 3D tabletop scenes, where object appearance, depth, clutter, and occlusion pose additional challenges for counting.

References

One key outstanding question, which we leave for future work, is whether the approach described here could be used to help counting in naturalistic images (e.g. 3D tabletop scenes 69).

— Zero-shot counting with a dual-stream neural network model (2405.09953 - Thompson et al., 16 May 2024) in Discussion, final paragraph

Scaling the dual-stream model to naturalistic images

Sponsor

Background

References

Related Problems