GME-Net: CNN for Low-Res Facial Expression Recognition

Updated 13 November 2025

GME-Net is a CNN architecture designed to robustly recognize facial expressions from ultra-low resolution images by combining hybrid attention for local cues with multi-scale global feature extraction.
It employs a dual teacher–student network where the high-resolution branch guides the low-resolution branch via attention-similarity knowledge distillation.
Experimental evaluations on benchmarks like RAF-DB and FER2013 demonstrate state-of-the-art performance with minimal computational overhead.

The Global Multiple Extraction Network (GME-Net) is a convolutional neural network architecture designed to address the inherent challenges of facial expression recognition (FER) from very low-resolution images (12×12–14×14 pixels). By combining hybrid attention-based local feature extraction with multi-scale global feature extraction and introducing an attention-similarity knowledge distillation regime, GME-Net achieves state-of-the-art performance on several benchmarks with minimal computational overhead.

1. Motivation and Challenges

Accurate FER in real-world scenarios such as surveillance and teleconferencing is hindered by extreme downsizing of facial images. Low resolution eliminates critical fine-grained facial cues (e.g., wrinkles, subtle muscle dynamics), limiting the discriminative power of standard CNNs and attention mechanisms. Existing local attention models tend to overfit to noise and fail to integrate global context (e.g., head pose, occlusion), resulting in degraded recognition accuracy. GME-Net is formulated specifically to address these challenges and to enable robust FER on drastically downsampled inputs.

2. Network Architecture

GME-Net utilizes a dual-branch, teacher–student architecture:

High-Resolution Network (HR-Net, Teacher):
- Input: 112×112 or original resolution images.
- Backbone: ResNet-50, augmented with multiple Mixed-Attention Blocks (MABs).
- Outputs: Per-block channel ( $M_{C,T}$ ) and spatial ( $M_{S,T}$ ) attention maps.
- Objective: Trained with cross-entropy on high-resolution ground truth.
Low-Resolution Network (LR-Net, Student):
- Input: Images downsampled to 12×12–14×14 (using bicubic upsampling, with optional Gaussian blur).
- Backbone: Weight-sharing ResNet-50 with MAB and Mixed-Channel Blocks (MCB) identical to HR-Net.
- Learning: Mimics attention maps from HR-Net via an attention-similarity distillation loss ( $L_{kd}$ ) alongside standard cross-entropy ( $L_{ce}$ ).

Final local and global feature maps (from MAB and MCB) are fused by element-wise addition and passed to a fully connected classifier. The total loss is $L = L_{ce} + \lambda_{kd} L_{kd}$ , where $\lambda_{kd}=5$ .

3. Core Components and Mathematical Formulations

3.1 Mixed-Attention Block (MAB)

Input $F \in \mathbb{R}^{H \times W \times C}$
Two $3 \times 3$ convolutions → Depthwise Block Attention Mechanism (DBAM) → Residual addition
DBAM integrates:
- Depthwise-Channel Attention Module (DCAM): Enhances discriminative channels via a depthwise-separable convolution, pooling, and shared MLP, followed by channel-wise gating.
- Depthwise-Spatial Attention Module (DSAM): Generates spatial attention using pooled feature maps, concatenation, and gating.
Mathematical summary:

$O = \mathrm{DSAM}\left(\mathrm{DCAM}\left(\mathrm{Conv}_{3\times3}(\mathrm{Conv}_{3\times3}(F))\right)\right) \oplus F$

where $\oplus$ is element-wise addition.

3.2 Mixed-Channel Feature Extraction Block (MCB)

Input: $F \in \mathbb{R}^{H \times W \times C}$
Branch 1: Replicates and fuses shallow features across 4 paths with depthwise convolutions.
Branch 2: Splits channel groups and progressively fuses them.
Output: Concatenates branch features and fuses with the residual:

$O = O_1 + O_2 + F$

Ensures multi-scale contextual aggregation and reduces noise sensitivity.

3.3 Attention-Similarity Knowledge Distillation

At each MAB, channel and spatial attention maps from teacher ( $M_{C,T}, M_{S,T}$ ) and student ( $M_{C,S}, M_{S,S}$ ) are compared.
Cosine similarity enforces mimicry:

$L_{kd} = 1 - \frac{\mathrm{sim}_c + \mathrm{sim}_s}{2}$

where $\mathrm{sim}_{c,s}$ are cosine similarities between teacher and student maps.

3.4 Classification and Loss

Standard cross-entropy is used for expression classification.
Total loss aggregates cross-entropy and knowledge distillation loss.

4. Training Protocols and Implementation Details

Dataset Preparation

RAF-DB: 14k HR faces, bicubic downsampled to 14×14.
ExpW: ~87k samples, landmark aligned, then 14×14 downsampling.
FER2013 / FERPlus: 48×48 resized to 12×12. Gaussian blur simulates realistic LR degradation.

Hyperparameter Configuration

Backbone initial channels: 32.
Optimizer: SGD, momentum 0.9, initial learning rate 0.1, decay ×0.4 every 20 epochs.
Batch size: 64, total epochs: 100.
Distillation coefficient: $\lambda_{kd} = 5$ .
Hardware: NVIDIA RTX 3090.

Data Augmentation

HR-Net: Random horizontal flip, crop with padding, color jitter.
LR-Net: Augmentation performed post-upsampling.

5. Experimental Evaluation

Benchmark Results

Method	RAF-DB (14x14)	FER2013 (12x12)	FERPlus (12x12)	ExpW (14x14)	GFLOPs	Params (M)
GME-Net	75.52	56.62	70.57	67.45	2.99	18.75
POSTER	74.09	—	—	66.56	—	—
Heidari et al.	74.67	—	71.01	65.70	3×	—
MA-Net	72.43	—	—	66.56	—	—
EAC	69.85	—	—	—	—	—
Ada-CM	59.32	—	—	—	—	—

Ablation Analysis

Model	RAF-DB (%)	FER2013 (%)
Baseline (ResNet-50)	71.07	50.13
+ CBAM	73.73	52.52
+ DBAM	74.29	54.75
+ Global Module (MCB)	71.84	50.33
DBAM + MCB (no KD)	71.54	50.96
Full GME-Net	75.52	56.62

DBAM yields substantial improvement over CBAM (RAF: +3.22%; FER: +4.62%).
MCB provides incremental gains, maximized when coupled with DBAM and KD.
Knowledge distillation decisively boosts accuracy (RAF: +3.99%; FER: +5.66%).

6. Qualitative Interpretation and Feature Analysis

Attention-heatmaps demonstrate that DCAM/DSAM modules in the teacher network systematically attend to regions critical for FER (eyebrows, eye corners, nasolabial folds). The knowledge distillation process enables the student to replicate these attention patterns, even at 14×14 input, enhancing discriminative power in low-resolution settings. Analysis of global feature maps pre- and post-MCB shows improved class separability, with two-branch fusion capturing both global facial structure and localized contrasts, such as mouth corners.

7. Limitations and Future Prospects

GME-Net’s training protocol relies on synthetic LR generation (bicubic downsampling, Gaussian blur), which may not fully encapsulate real-world degradations like motion blur and heavy compression. Scene variations, notably challenging lighting and extreme head poses, reduce FER robustness; suggested future directions include meta-learning and domain adaptation for cross-domain generalization. Model compression and optimization for real-time edge deployment are anticipated to further reduce GFLOPs and memory footprint.

A plausible implication is that hybrid attention and multi-scale extraction methodologies, when jointly optimized under attention-similarity knowledge distillation, constitute an effective solution for low-resolution FER without incurring prohibitive computational cost.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Global Multiple Extraction Network (GME-Net).