Incorporating simulated spatial context information improves the effectiveness of contrastive learning models (2401.15120v2)
Abstract: Visual learning often occurs in a specific context, where an agent acquires skills through exploration and tracking of its location in a consistent environment. The historical spatial context of the agent provides a similarity signal for self-supervised contrastive learning. We present a unique approach, termed Environmental Spatial Similarity (ESS), that complements existing contrastive learning methods. Using images from simulated, photorealistic environments as an experimental setting, we demonstrate that ESS outperforms traditional instance discrimination approaches. Moreover, sampling additional data from the same environment substantially improves accuracy and provides new augmentations. ESS allows remarkable proficiency in room classification and spatial prediction tasks, especially in unfamiliar environments. This learning paradigm has the potential to enable rapid visual learning in agents operating in new environments with unique visual characteristics. Potentially transformative applications span from robotics to space exploration. Our proof of concept demonstrates improved efficiency over methods that rely on extensive, disconnected datasets.
- Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25, 1075–1088. https://doi.org/10.1109/TPAMI.2003.1227984.
- Real-time computerized annotation of pictures. IEEE Trans. Pattern Anal. Mach. Intell. 30, 985–1002. https://doi.org/10.1109/TPAMI.2007.70847.
- Imagenet classification with deep convolutional neural networks. Commun. ACM. 60, 84–90. https://dl.acm.org/doi/10.1145/3065386.
- Automated 3D segmentation of guard cells enables volumetric analysis of stomatal biomechanics. Patterns 3, 100627. https://doi.org/10.1016/j.patter.2022.100627.
- DeepStroke: An efficient stroke screening framework for emergency rooms with multimodal adversarial deep learning. Med. Image Anal. 80, 102522. https://doi.org/10.1016/j.media.2022.102522.
- ARBEE: Towards automated recognition of bodily expression of emotion in the wild. Int. J. Computer Vision 128, 1–25. https://doi.org/10.1007/s11263-019-01215-y.
- Unlocking the emotional world of visual media: An overview of the science, research, and impact of understanding emotion. Proc. IEEE 111, 1–51. https://doi.org/10.1109/JPROC.2023.3273517.
- Deep learning for surface material classification using haptic and visual information. IEEE Trans. Multimedia 18, 2407–2416. https://doi.org/10.1109/TMM.2016.2598140.
- Rating image aesthetics using deep learning. IEEE Trans. Multimedia 17, 2021–2034. https://doi.org/10.1109/TMM.2015.2477040.
- Surface defect detection and evaluation for marine vessels using multi-stage deep learning. arXiv preprint arXiv . https://doi.org/10.48550/arXiv.2203.09580.
- ImageNet: A large-scale hierarchical image database. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, pp. 248–255. IEEE. https://doi.org/10.1109/CVPR.2009.5206848.
- Exploring the limits of weakly supervised pretraining. In Proc. European Conf. Computer Vision, pp. 181–196. Springer. https://doi.org/10.1007/978-3-030-01216-8_12.
- Do ImageNet classifiers generalize to ImageNet? In Proc. Int. Conf. Machine Learning, pp. 5389–5400. PMLR. https://doi.org/10.48550/arXiv.1902.10811.
- Evaluating machine accuracy on ImageNet. In Proc. Int. Conf. Machine Learning, pp. 8634–8644. PMLR. https://dl.acm.org/doi/10.5555/3524938.3525739.
- Generative adversarial nets. In Adv. Neural Inf. Process. Syst., pp. 2672–2680. MIT Press volume 27. https://dl.acm.org/doi/10.5555/2969033.2969125.
- The faces in infant-perspective scenes change over the first year of life. PLOS One 10, e0123780. https://doi.org/10.1371/journal.pone.0123780.
- Individual differences in infant fixation duration relate to attention and behavioral control in childhood. Psychol Sci. 25, 1371–1379. https://doi.org/10.1177/0956797614531295.
- Frank, M. C. (2023). Bridging the data gap between children and large language models. Trends Cogn. Sci. 27, 990–992. https://doi.org/10.1016/j.tics.2023.08.007.
- Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? arXiv preprint arXiv . https://doi.org/10.48550/arXiv.2201.05119.
- Deictic codes for the embodiment of cognition. Behav. Brain Sci. 20, 723–742. https://doi.org/10.1017/s0140525x97001611.
- Smith, L. B. (2005). Cognition as a dynamic system: Principles from embodiment. Dev. Rev. 25, 278–298. https://doi.org/10.1016/j.dr.2005.11.001.
- Travel broadens the mind. Infancy 1, 149–219. https://doi.org/10.1207/S15327078IN0102_1.
- Joint attention without gaze following: Human infants and their parents coordinate visual attention to objects through eye-hand coordination. PLOS One 8, e79659. https://doi.org/10.1371/journal.pone.0079659.
- A bottom-up view of toddler word learning. Psychon. Bull. Rev. 21, 178–185. https://doi.org/10.3758/s13423-013-0466-4.
- Mechanisms of theory formation in young children. Trends Cogn. Sci. 8, 371–377. https://doi.org/10.1016/j.tics.2004.06.005.
- Spelke, E. S. (1990). Principles of object perception. Cogn. Sci. 14, 29–56. https://doi.org/10.1207/s15516709cog1401_3.
- Learning physical parameters from dynamic scenes. Cogn. Psychol. 104, 57–82. https://doi.org/10.1016/j.cogpsych.2017.05.006.
- Effects of explaining on children’s preference for simpler hypotheses. Psychon. Bull. Rev. 24, 1538–1547. https://doi.org/10.3758/s13423-016-1144-0.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv . https://doi.org/10.48550/arXiv.2003.04297.
- Bootstrap your own latent-a new approach to self-supervised learning. In Adv. Neural Inf. Process. Syst., pp. 21271–21284. Curran Associates Inc. volume 33. https://dl.acm.org/doi/abs/10.5555/3495724.3497510.
- A simple framework for contrastive learning of visual representations. In Proc. Int. Conf. Machine Learning, pp. 1597–1607. JMLR.org volume 119. Https://doi.org/10.5555/3524938.3525087.
- Unsupervised neural network models of the ventral visual stream. Proc. Natl. Acad. Sci. U.S.A 118, e2014196118. https://doi.org/10.1073/pnas.2014196118.
- Gibson, J. J. (1966). The Senses Considered as Perceptual Systems. Houghton Mifflin.
- The role of locomotion in psychological development. Front. Psychol. 4, 440. https://doi.org/10.3389/fpsyg.2013.00440.
- Recognition of common object-based categories found in toddler’s everyday object naming contexts. In Proc. Annu. Meet. Cogn. Sci. Soc., . cognitivesciencesociety.org.
- ThreeDWorld: A platform for interactive multi-modal physical simulation. arXiv preprint arXiv . https://doi.org/10.48550/arXiv.2007.04954.
- Diverse deep neural networks all predict human inferior temporal cortex well, after training and fitting. J. Cogn. Neurosci. 33, 2044–2064. https://doi.org/10.1162/jocn_a_01755.
- Decoupled contrastive learning. In Proc. European Conf. Computer Vision, pp. 668–684. Springer. https://doi.org/10.1007/978-3-031-19809-0_38.
- Contrastive learning with stronger augmentations. IEEE Trans. Pattern Anal. Mach. Intell. 45, 5549–5560. https://doi.org/10.1109/TPAMI.2022.3203630.
- With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proc. IEEE/CVF Int. Conf. on Computer Vision, pp. 9588–9597. IEEE. https://doi.org/10.1109/ICCV48922.2021.00945.
- Deep residual learning for image recognition. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, pp. 770–778. IEEE. https://doi.org/10.1109/CVPR.2016.90.
- An empirical study of training self-supervised vision transformers. In Proc. IEEE/CVF Int. Conf. Computer Vision, pp. 9640–9649. https://doi.org/10.1109/ICCV48922.2021.00950.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv . https://doi.org/10.48550/arXiv.2104.02057.
- Lightly. https://github.com/lightly-ai/lightly.
- Ego4D: Around the world in 3,000 hours of egocentric video. In Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, pp. 18995–19012. IEEE. https://doi.org/10.1109/cvpr52688.2022.01842.
- Unsupervised natural experience rapidly alters invariant object representation in visual cortex. Science 321, 1502–1507. https://doi.org/10.1126/science.11600.
- The development of invariant object recognition requires visual experience with temporally smooth objects. Cogn. Sci. 42, 1391–1406. https://doi.org/10.1111/cogs.12595.
- Stochastic neighbor embedding. In Adv. Neural Inf. Process. Syst., p. 857–864. MIT Press volume 15. https://dl.acm.org/doi/abs/10.5555/2968618.2968725.
- Supervised contrastive learning. In Adv. Neural Inf. Process. Syst., pp. 18661–18673. Curran Associates Inc. volume 33. https://dl.acm.org/doi/abs/10.5555/3495724.3497291.
- Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7.
- A dendrite method for cluster analysis. Commun. Stat. Theory Methods 3, 1–27. https://doi.org/10.1080/03610927408827101.
- A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1, 224–227. https://doi.org/10.1109/TPAMI.1979.4766909.