Multi-Scales Bi-Hemispheric Asymmetric Model
- The paper presents a novel CNN model that integrates multi-scale feature extraction with explicit hemispheric asymmetry to enhance classification accuracy.
- The architecture uses dual-path modules to separately process spatial and temporal features, mimicking neurophysiological lateralization with asymmetric computations.
- Experimental evaluations demonstrate significant accuracy improvements over baseline models, achieving state-of-the-art results in both EEG emotion recognition and visual object classification.
The Multi-Scales Bi-hemispheric Asymmetric Model (MSBAM) is a convolutional neural network architecture designed to incorporate both multi-scale feature extraction and explicit modeling of hemispheric asymmetry, with demonstrated efficacy in domains such as emotion recognition from EEG and visual object recognition. MSBAM leverages the nonstationary and lateralized nature of biological neural signals, systematically extracting spatial and temporal features at multiple resolutions while computing differences across left and right hemispheric representations. Central to the MSBAM approach are its dual-path architectural motifs, each handling complementary aspects of the signal, and branch-integration strategies that capture auxiliary discriminative information rooted in neurophysiological principles (Wu et al., 2021, Rajagopalan et al., 2022).
1. Architectural Framework
MSBAM comprises a series of hierarchical modules, each crafted to operationalize a principle from neuroscience and signal processing. The overarching theme is division along hemispheric lines—real or conceptual—combined with the processing of signal characteristics at multiple scales.
In the EEG-based paradigm, as described in (Wu et al., 2021), MSBAM ingests a 3D array , representing spatial electrode topography over a 1-second temporal window. The processing proceeds through three principal blocks:
- Spatial Feature Extractor: Compresses the spatial configuration into a 40-dimensional vector via cascaded convolutional layers (1%%%%23%%%%128, then two 33 layers) with SELU activations, followed by fully-connected normalization.
- Temporal Feature Extractor: Utilizes parallel branches, each with a single 3D convolution (16[9%%%%78%%%%L]) operating at temporal scales 0 samples (1 s) and 1 samples (0.5 s). Each branch splits the signal into left/right hemispheres, computes shared-weight convolutional feature maps 2, 3, and an asymmetric component 4. After normalization, the concatenated outputs form an 80-dimensional vector 5.
- Classification Block: Concatenates 6 and 7 into a 120-dimensional representation, applies dropout (rate 0.7), and projects onto the class space through a fully connected layer and softmax.
In the visual domain (e.g., CIFAR-100), the model instantiates dual “hemispheric” branches (left-local and right-global), each using an identical CNN backbone (ResNet-9 or VGG-11A) but trained on distinct supervisory signals to induce lateralized specialization (Rajagopalan et al., 2022). Branches are later fused via concatenation and a linear classifier head.
2. Mathematical Operations and Asymmetric Feature Computation
A defining operation within MSBAM is the explicit computation of asymmetric features. For each temporal scale 8, the input 9 is split:
- 0 (left), 1 (right, mirrored to align positions).
- Both are processed by a shared-parameter convolution: 2.
- The asymmetry is captured as 3.
- The triplet 4 is flattened and projected to a normalized 40-dimensional vector by a Feature Normalization Layer (FNL) incorporating batch normalization and softmax rescaling.
This structural design emphasizes the importance of lateralized processing, directly targeting neurophysiological correlates of information differentiation across hemispheres (e.g., emotional lateralization in EEG, or local/global differences in vision).
3. Training Protocol and Optimization
MSBAM training involves supervised optimization via cross-entropy loss:
5
where 6 is the one-hot target and 7 the softmax probability.
- Optimizer: Adam, initial learning rate 8, decayed to 9 at epoch 30 (EEG) or set to 0 for vision; weight decay 1 where specified.
- Regularization: Dropout (EEG: 0.7 before classification; vision: 0.6 on classifier heads).
- Batching: EEG experiments employ full within-subject 10-fold cross-validation; vision experiments use batch size 256 over 180 epochs per phase.
In multi-phase training for visual recognition (Rajagopalan et al., 2022), Phase 1 independently specializes each hemisphere’s branch (fine vs. coarse label supervision), followed by Phase 2, in which branch weights are frozen, and the integration head is trained for joint classification.
4. Multi-Scale Design Rationale and Implications
Multi-scale convolutional processing addresses the nonstationarity of neural signals by extracting features corresponding to a range of temporal or spatial resolutions. For EEG, kernels covering both long (1 s) and short (0.5 s) windows enable the extraction of both slow-varying rhythms and transient spike-like features. For vision, repeated pooling and varying receptive fields within deep CNNs capture hierarchical structure, from fine edges to object-level configuration.
This promotes complementarity in the resulting representations, enabling later integration modules to selectively attend to the most discriminative features for a given task or input. The presence of such multi-resolution encoding is empirically validated; ablations removing a branch or normalization layer induce a significant drop in accuracy (Wu et al., 2021).
5. Experimental Evaluation and Performance
Performance on Emotion Recognition (EEG):
- Datasets: DEAP (32 subjects, 40 trials, 128 Hz, 76,800 samples); DREAMER (23 subjects, 14 channels, 85,744 samples).
- Tasks: Binary classification in arousal, valence, dominance, and liking via subject-dependent 10-fold CV.
- Accuracies: DEAP—valence: 99.36 ± 0.46%, arousal: 99.37 ± 0.43%, dominance: 99.39 ± 0.41%, liking: 99.46 ± 0.46%; DREAMER—valence: 99.69 ± 0.24%, arousal: 99.76 ± 0.20%, dominance: 99.79 ± 0.22%.
- Comparison: These results substantially exceed baseline models, with SOTA on all dimensions (Wu et al., 2021).
Performance on Visual Recognition (CIFAR-100):
- Configurations: Both ResNet-9 and VGG-11A backbones evaluated; multiple training settings: unilateral (unspecialised/specialised), bilateral (unspecialised/specialised), and 2-model ensemble.
- Findings: Bilateral-specialised MSBAM surpasses equally-parameterized single-branch baselines in both fine- and coarse-label accuracy. The 2-model ensemble marginally surpasses MSBAM (by up to 1.5%) but requires full dual-network inference.
- Interpretability: Grad-CAM analysis shows the left branch attends to localized details, the right to global context; the classifier head adjusts its weights according to input discriminativity (Rajagopalan et al., 2022).
Selected Results Table: CIFAR-100, ResNet-9 Backbone
| Model | #Params | Fine Acc. (%) | Coarse Acc. (%) |
|---|---|---|---|
| Unilateral-unspecialised | 6.63 M | 67.81 ± 0.46 | 78.25 ± 0.51 |
| Bilateral-specialised | 13.26 M | 71.35 ± 0.36 | 80.71 ± 0.48 |
| Ensemble (2× Unilateral) | 13.26 M | 72.15 ± 0.11 | 81.80 ± 0.24 |
6. Analysis: Mechanistic Insights and Comparative Context
MSBAM’s dual-branch, bi-hemispheric structure approximates the lateralization observed in animal and human brains, where different hemispheres preferentially process distinct aspects of information (e.g., emotion, spatial frequency). In EEG, direct computation of 2 distills neurophysiologically relevant asymmetries into the feature space. In vision, explicit local/global specialization is induced through selective supervision.
Normalization layers ensure that outputs of multi-scale, multi-branch pathways are comparably weighted prior to classification, mitigating feature dominance from a particular spatial or temporal resolution. Analyses of representational similarity and attention allocation suggest that the fused head compensates for branch-specific weaknesses, improving robustness to class-invariant noise and ambiguous input.
Comparatively, while conventional ensembles induce diversity via random initialization, MSBAM enforces structured, physiologically grounded diversity within a single, integrated architecture. This enables efficient inference, as only one model—albeit dual-branched—executes at prediction time.
7. Significance and Future Prospects
The MSBAM paradigm establishes a meta-architecture applicable to domains where lateralized processing and multiscale temporal/spatial dynamics are salient. Its direct encoding of domain knowledge—through both the asymmetry operator and the multi-scale convolutional structure—offers a robust framework for feature extraction in nonstationary, topographically organized signals. The model’s empirical performance in emotion recognition and robust feature fusion in vision suggest further utility in other cognitive and sensorimotor tasks exhibiting similar structural priors.
Future research directions may involve expansion to multi-class or multi-label targets, dynamic (input-dependent) attention integration, and integration with graph-based spatial representations (especially for irregular sensor layouts). A plausible implication is that such architectures may generalize to other bi-hemispheric or modular neural systems, providing a template for inductive bias in biologically inspired AI and BCI models (Wu et al., 2021, Rajagopalan et al., 2022).