Vision transformers (ViTs) represent a state-of-the-art approach in the field of computer vision, outperforming several models across various tasks, from object recognition to visual navigation. ViTs, like human and animal brains, possess deep computational similarities and excel in image classification and error patterns. However, there is a concern in the research community pertaining to ViTs' dependency on vast amounts of training data—far more than what biological systems seem to require. This apparent high data requirement has garnered ViTs a reputation of being "data hungry," causing some skepticism regarding their efficiency compared to biological learning processes.
A paper aims to bridge this gap in understanding by comparing the learning abilities of ViTs and biological systems—specifically newborn chicks. The paper utilizes a digital twin approach, where virtual animal chambers were constructed in a game engine, mirroring controlled visual environments used to rear chicks. The first-person images recorded as the avatar moved and interacted within this virtual space were used to train self-supervised ViTs, employing time as a teaching signal. This training setup closely models the learning conditions experienced by newborn chicks.
The vision transformers trained through this method faced the challenge of view-invariant object recognition, tasked to identify objects from various viewpoints, similar to tests conducted with newborn chicks. When trained with these "through-the-eyes-of-a-chick" data streams, ViTs demonstrated capabilities to solve the assigned tasks and develop animal-like object recognition. Both newborn chicks and ViTs learned robust visual features from the same impoverished conditions, suggesting that, contrary to the data-hungry critique, transformers might possess an innate efficiency comparable to biological learning systems when exposed to rich temporal visual streams.
The paper ventured beyond using a single ViT variant, applying different architectures, such as Vision Transformers with Contrastive Learning through Time (ViT-CoT) and Video Masked Autoencoders (VideoMAE). Each model was trained and evaluated against the visual tasks, yielding results that confirmed their ability to learn efficiently in environments with limited object exposures. Additionally, the research investigated variations by modifying the number of learning images and ViT sizes, finding that larger ViTs were not necessarily more data-hungry than smaller counterparts within an embodied visual context.
The findings extend their implications to the long-standing debate regarding the development of cognition, proposing that a generic learning mechanism, rather than domain-specific systems, may suffice for acquiring high-level visual capacities. This revelation hints at the potential evolution of artificial intelligence closer to biological plausibility, suggesting future AI systems might learn more flexibly and rapidly, akin to animal cognition.
While the results are compelling, the paper acknowledges limitations, such as the passive training of models without interactive data collection,an essential aspect of active learning in biological systems. Addressing these constraints could bring forth a new era of "naturally intelligent" AI systems inspired directly by the learning mechanisms inherent in animal cognition.