ExpressNet-MoE: Hybrid Model for Robust FER
- The paper introduces ExpressNet-MoE, which combines parallel CNN feature extractors with a Mixture of Experts module and a residual backbone to enhance facial emotion recognition under diverse conditions.
- It employs adaptive expert selection through a gating network that dynamically fuses global, local, and mid-level features for improved performance across challenging real-world scenarios.
- Empirical results on benchmarks like AffectNet and RAF-DB validate its competitive accuracy and practical versatility in applications such as online education and healthcare.
ExpressNet-MoE is a hybrid deep learning architecture developed for robust facial emotion recognition under challenging real-world conditions such as pose variation, occlusion, illumination changes, and demographic diversity (Banerjee et al., 15 Oct 2025). The model integrates multiple parallel convolutional neural network (CNN) feature extractors with a Mixture of Experts (MoE) module, culminating in a residual backbone for deep facial representation learning. The MoE framework enables adaptive, sample-specific expert selection and flexible feature fusion, resulting in improved generalization and competitive accuracy across several public benchmarks.
1. Architecture and Component Design
ExpressNet-MoE constitutes three main feature extraction streams:
- CNN Feature Extractor 1 (CNNFE1): Employs large initial kernels (e.g., 75×75) followed by progressively smaller kernels (down to 3×3) with increasing filter count (from 8 up to 256). Each convolution layer is followed by dropout, batch normalization, ReLU activation, and max-pooling, formalized as
and
After flattening, a dense layer with 512 units refines features for downstream processing.
- CNN Feature Extractor 2 (CNNFE2): Begins with a 15×15 convolution (16 filters), cascading through 7×7, 5×5, and 3×3 kernels with filter counts incrementing to 256. Rather than flattening, it applies Global Average Pooling (GAP), reducing each feature map to a scalar:
- Residual Backbone (ResNet-50): A pre-trained 50-layer residual network with its top classification head removed (include_top=False) and followed by dropout. This backbone, trained on VGGFace2, extracts high-level semantic features robust to identity and pose variance.
Features from CNNFE2 and ResNet-50 are concatenated and passed through a dense layer with 512 units. Separately, the output from CNNFE1 and the aforementioned dense layer traverse distinct MoE modules, with their outputs concatenated for final prediction.
The MoE module comprises multiple parallel expert dense layers (e.g., 4), each using ReLU. A gating network computes softmax probabilities over expert logits: Top-k (typically k=2) experts are selected, and their outputs are fused via weighted summation according to the gating probabilities.
2. Multi-Scale and Hierarchical Feature Extraction
ExpressNet-MoE's architecture enables hierarchical feature learning at multiple scales:
- Global features are captured by initial large kernels in CNNFE1 and the hierarchical blocks of ResNet-50, targeting coarse facial structure and configuration.
- Local features emerge via small-kernel convolutions and deep layers, sensitive to subtle muscle movements and fine-grained emotion cues.
- Efficient mid-level features in CNNFE2, summarized by GAP, further capture textural variation.
Parallel extractors and adaptive fusion ensure that both local and global nuances are incorporated, improving robustness to real-world facial image perturbations.
3. Adaptive Expert Selection and Generalization
The Mixture of Experts module delivers adaptive feature selection:
- The gating network routes each input to the most relevant subset of experts, facilitating sample-specific processing and specialization.
- By allowing multiple distinct pathways across expert networks and fusion points, ExpressNet-MoE improves generalization across varied datasets and conditions (such as pose, occlusion, and demographic variation).
- Dynamic expert selection ensures that the model does not rigidly rely on a fixed feature map, but adaptively combines specialized and generalized representations per input.
4. Empirical Performance and Benchmark Results
ExpressNet-MoE was evaluated on multiple benchmark datasets for facial emotion recognition:
Dataset | Accuracy (%) |
---|---|
AffectNet (v7) | 74.77 |
AffectNet (v8) | 72.55 |
RAF-DB | 84.29 |
FER-2013 | 64.66 |
Additional metrics, including macro-average precision, recall, and F₁-scores, were reported. Notably, the accuracy on AffectNet and FER-2013 outperforms or approaches contemporary models (e.g., ResEmoteNet, EfficientNet variants, EmoNeXt), while results on RAF-DB are competitive with domain specialists. Training/validation accuracy and loss curves indicate stable convergence and robustness to cross-domain testing.
5. Real-World Applications
ExpressNet-MoE is suited for end-to-end emotion recognition in several practical domains:
- Online Education: Enables real-time assessment of learner engagement and emotional state in virtual classrooms.
- Healthcare: Supports remote emotional analysis during telemedicine interactions for mental health or behavioral assessment.
- Human-Computer Interaction: Facilitates natural responsive interfaces for customer service, assistive technologies, and engagement monitoring.
- Adaptivity to variable head pose, occlusion, and illumination via multi-scale learning and dynamic expert selection is particularly advantageous for deployment in unconstrained environments.
6. Reproducibility and Implementation
The authors commit to full reproducibility, making available:
- Model code with detailed definitions for CNN extractors, MoE module, ResNet integration.
- Training scripts supporting multi-GPU scaling via TensorFlow's MirroredStrategy.
- Preprocessing pipelines using BlazeFace for consistent face detection and alignment.
- Datasets, training hyperparameters, and instructions for direct replication and extension.
All resources can be found at https://github.com/DeeptimaanB/ExpressNet-MoE, facilitating open benchmarking and future research.
7. Context and Comparative Significance
ExpressNet-MoE exemplifies a hybrid approach combining deep CNNs, sophisticated residual learning, and adaptive MoE modules for facial emotion recognition. The model leverages multi-stream, multi-expert fusion mechanisms to bridge accuracy gaps posed by real-world imaging complexity. Compared against state-of-the-art approaches, ExpressNet-MoE demonstrates enhanced generalization, competitive accuracy, and robust deployment characteristics for applied emotion recognition systems.