Improving action classification with brain-inspired deep networks (2512.07729v1)

Published 8 Dec 2025 in cs.CV and cs.AI

Abstract: Action recognition is also key for applications ranging from robotics to healthcare monitoring. Action information can be extracted from the body pose and movements, as well as from the background scene. However, the extent to which deep neural networks (DNNs) make use of information about the body and information about the background remains unclear. Since these two sources of information may be correlated within a training dataset, DNNs might learn to rely predominantly on one of them, without taking full advantage of the other. Unlike DNNs, humans have domain-specific brain regions selective for perceiving bodies, and regions selective for perceiving scenes. The present work tests whether humans are thus more effective at extracting information from both body and background, and whether building brain-inspired deep network architectures with separate domain-specific streams for body and scene perception endows them with more human-like performance. We first demonstrate that DNNs trained using the HAA500 dataset perform almost as accurately on versions of the stimuli that show both body and background and on versions of the stimuli from which the body was removed, but are at chance-level for versions of the stimuli from which the background was removed. Conversely, human participants (N=28) can recognize the same set of actions accurately with all three versions of the stimuli, and perform significantly better on stimuli that show only the body than on stimuli that show only the background. Finally, we implement and test a novel architecture patterned after domain specificity in the brain with separate streams to process body and background information. We show that 1) this architecture improves action recognition performance, and 2) its accuracy across different versions of the stimuli follows a pattern that matches more closely the pattern of accuracy observed in human participants.

Summary

The paper demonstrates that separately processing body and background cues through a brain-inspired two-stream architecture significantly enhances action recognition.
The methodology integrates YOLO v8 for segmentation and MotionNet for optical flow, validated using the Human Atomic Actions 500 dataset.
Results show that the DomainNet model achieves near-human accuracy, particularly with body-only inputs, suggesting new avenues for neural network design.

Improving Action Classification with Brain-Inspired Deep Networks

Abstract

The paper "Improving action classification with brain-inspired deep networks" (2512.07729) investigates action recognition from visual input by exploring the potential for leveraging brain-inspired deep network architectures. The authors highlight the cognitive importance of perceiving others' actions, which is essential for understanding goals, emotions, and traits, and for applications such as robotics and healthcare monitoring. This study challenges the prevalent paradigms in deep learning by assessing how humans and artificial systems utilize body and background information for action recognition, proposing a novel architecture inspired by human brain selectivity for processing body and scene information separately.

Introduction

Humans inherently possess the ability to recognize actions amidst varied visual contexts, guided by specialized brain regions. While deep neural networks also perform action recognition, they often jointly process body and background information, potentially neglecting body-specific cues due to background correlation within training datasets. This paper probes the distinct effectiveness of humans in extracting both body and background information and introduces brain-inspired architectures to mimic this selective processing.

The authors demonstrate that traditional deep networks trained on datasets like the Human Atomic Actions 500 can perform adequately when both body and background are preserved but fail substantially when either is isolated. Humans, conversely, maintain recognition accuracy regardless of the presence of both cues, showing a preference for body-specific information. The paper implements a two-stream network architecture, inspired by the domain-specific pathways in the brain, to enhance action classification performance.

The domain-specific neural architecture builds upon established methodologies such as Dropout, which mitigates redundancy in network processing, albeit lacking the specific segmentation of body and background streams. Dropout ensures generalization through varied sub-network processing, yet fails to represent body and scene information effectively across separate channels. The paper also addresses advances in pose-estimation models that leverage body kinematics for action recognition, while asserting that simpler architectures can achieve similar improvements by integrating brain-inspired selectivity.

Additionally, the authors connect their work with foundations laid by vision-related neural network architectures, such as two-stream convolutional networks, inspired by the ventral and dorsal differentiation observed in human vision. This paper extends the concept by introducing category-selective processing that mirrors neural functionalities for object and scene categorizations in humans.

Methods

The study employs the Human-centric Atomic Actions 500 dataset, segmented into three categories: original, body-only, and background-only frames. Utilizing YOLO v8 for segmentation and MotionNet for optical flow computation, the authors facilitate a comparative analysis between human subjects and artificial networks. Their methodology involves training conventional neural networks alongside their proposed architecture, DomainNet, which separately processes body and background information.

Both Baseline and DomainNet models were evaluated on both static and dynamic features, using ResNet-50 architecture for frame processing. The DomainNet model incorporated cross-entropy loss terms across body, background, and combined stream outputs, ensuring optimized learning across all modalities of input.

Figure 1: Comparison between network trained using original frames (Baseline:frames model), human results and brain-inspired two stream network (DomainNet:frames).

Results

Behavioral Results

Human participants displayed near-perfect accuracy in recognizing actions from original video inputs and maintained substantial accuracy even when restricted to body-only inputs. Conversely, recognition accuracy dropped when limited to background-only stimuli. These findings suggest a predominant reliance on body cues for action inference, aligning with the hypothesis of domain-specific processing pathways in the brain.

Modeling Results

Baseline Models: Traditional Baseline networks exhibited poor performance on body-only stimuli, indicating a bias towards learning background-specific features, exacerbated by optic flow addition which offered mild improvements but did not resolve the recognition deficiency.

Domain-Specific Models: The novel brain-inspired architecture achieved higher accuracies across all stimuli types, notably enhancing body-only recognition performance, thereby presenting more human-like accuracy patterns. Optic flow integration further amplified the model's proficiency in discerning body-centric actions.

Figure 2: Training dynamics for networks tested, with both Baseline and DomainNet models converging efficiency in fewer than 20 epochs.

Discussion

The authors have confirmed that distinct processing of body and background information in deep networks leads to improved action classification, mirroring human capabilities. This approach not only bridges gaps between neural networks and human perceptual mechanisms but also offers insights into the evolutionary architecture of brain regions responsible for category selectivity. The implications extend to enhancing artificial vision systems and fostering collaborations between cognitive neuroscience and machine learning.

Conclusion

The research lays groundwork for future exploration into brain-inspired network architectures, urging the integration of category-selective pathways into advanced learning models such as transformers and foundation models. By employing cross-disciplinary perspectives, this study advances understanding of perception, both artificial and biological, and proposes paradigms for optimizing action recognition technology.