ARPGNet: Appearance & Relation-aware Graph Fusion
- The paper presents a novel approach that fuses spatial appearance and relational cues with temporal context for improved facial expression recognition.
- It employs a dual-branch network combining an InsightFace ResNet-50 backbone with graph attention networks to capture local symmetry and adjacency relations.
- Benchmark results and ablation studies demonstrate significant performance gains over conventional CNN-based FER systems across multiple datasets.
The Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet) is an architecture developed for facial expression recognition (FER) that jointly models both facial appearance and inter-region relations with explicit cross-modality and temporal fusion. ARPGNet utilizes spatial facial appearance representations alongside a graph-based encoding of local relational cues, harnessed and fused with graph attention mechanisms which also incorporate local temporal context. This approach addresses deficiencies in conventional FER systems that focus exclusively on appearance features extracted by Convolutional Neural Networks (CNNs), neglecting the structured relationships between facial subregions and their temporal interplay (Li et al., 27 Nov 2025).
1. Architecture Overview
ARPGNet consists of three principal modules: (1) a facial region relation graph branch, (2) an appearance representation backbone, and (3) a parallel graph attention fusion module. The overall workflow is as follows:
- Facial Region Relation Graph: Convolutional feature maps are subdivided into a grid, constructing a graph where each node is a facial patch and edges encode local adjacency and bilateral symmetry. Graph Attention Networks (GATs) are then applied to aggregate regional and symmetric context per frame.
- Appearance Backbone: Features are extracted per frame by an InsightFace ResNet-50 CNN pre-trained on MS-Celeb-1M, then fine-tuned on FER datasets; the first fully connected (FC) layer serves as the appearance embedding.
- Parallel Fusion: Temporal sequences from both branches are enhanced and fused in a graph attention scheme that operates on a constructed fusion graph, incorporating a temporal response scope (TRS) constraint to promote local temporal modeling and interaction between appearance and relation features without recurrence.
This design enables ARPGNet to simultaneously learn and integrate complementary intra-frame (appearance, relation) and inter-frame (temporal) cues (Li et al., 27 Nov 2025).
2. Facial Region Relation Graph Construction
After extracting convolutional feature maps (size ), adaptive average pooling divides the response into a grid of patches, yielding nodes per frame. The graph's fixed adjacency matrix is defined by:
- Self-connection,
- 4-neighborhood (spatial adjacency),
- Mirror symmetry about the vertical facial midline.
Formally,
$a_{ij} = \begin{cases} 1, & i = j, \ 1, & \text{%%%%5%%%% are spatial grid neighbors}, \ 1, & \text{%%%%6%%%% are symmetric about the midline}, \ 0, & \text{otherwise}. \end{cases}$
Three stacked GAT layers (attention heads, LeakyReLU activations) update node features: with multi-head aggregation. Global average pooling yields per-frame relation embeddings for temporal sequence (Li et al., 27 Nov 2025).
3. Appearance Feature Extraction Backbone
The appearance branch uses the InsightFace ResNet-50 backbone, structured as conv1 Res2–Res5 0 two FC layers. The first FC layer's output, 1, is used as the per-frame appearance feature. The network is fine-tuned on each target emotion dataset. The result is a temporal sequence 2 (Li et al., 27 Nov 2025).
4. Parallel Graph Attention Fusion with Temporal Response Scope
To enable effective fusion and temporal modeling:
- Positional Encoding: Standard sine/cosine encodings are added to 3 and 4 before fusion,
5
- Fusion Graph: The fusion graph 6 has 7 nodes—appearance (8) and relation (9)—with edges determined by the TRS hyperparameter:
0
This allows inter- and intra-modality attention within a 1 time window.
- GAT Fusion: The same multi-head GAT formulation is applied across this graph:
2
producing enhanced appearance (3) and relation (4) sequences.
- Aggregation and Classification: At each timestep 5, 6, which are temporally pooled (mean) to form a video representation 7 for MLP-based expression classification (Li et al., 27 Nov 2025).
5. Temporal Sequence Modeling
ARPGNet integrates temporal modeling directly within the fusion graph by leveraging the TRS constraint. This enforces attention locality in time, allowing the network to model local spatiotemporal dynamics and fine-grained cross-modality effects without introducing explicit sequence models like RNNs or global Transformers. This parallel, locality-aware attention mechanism improves cross-modal temporal interaction and reduces modeling complexity (Li et al., 27 Nov 2025).
6. Training Protocols, Data, and Implementation
- Data Preprocessing: Detection, alignment, and cropping to 8 are performed via OpenFace 2.0. For RML and AFEW, 9 frames per video are sparsely sampled (random for training, uniform for evaluation); Aff-wild2 employs dilated sequences of 8 frames (dilation=3).
- Optimization: Adam optimizer (0, 1, 2). Learning rates: 3 (ResNet blocks), 4 (others). Dropout (0.25) follows all FC layers. Kaiming initialization is used; PReLU/LeakyReLU activations are adopted in ResNet/GAT, respectively.
- Losses: Standard multi-class cross-entropy for balanced datasets (RML, AFEW):
5
and focal loss (6) for class-imbalanced Aff-wild2:
7
No additional regularization beyond dropout is used (Li et al., 27 Nov 2025).
- Implementation: PyTorch platform on NVIDIA TITAN RTX GPUs. Batch sizes and epochs typically range 8–9 and 0–1, respectively.
7. Benchmark Results and Ablation Analysis
ARPGNet demonstrates consistently strong or superior performance across major FER datasets:
| Dataset | State-of-the-art Baselines | ARPGNet Performance | ARPGNet w/pretraining |
|---|---|---|---|
| RML | MulT 73.3%, C3D+audio 73.9% | 76.53±3.05% | — |
| AFEW | HSE-NN 59.3%, MulT 55.87% | 57.70% | 60.05% |
| Aff-wild2 | MulT 0.536 (𝑀), HSE-NN 0.521 | 0.547 | 0.628 |
Ablation studies on AFEW reveal that:
- Single-stream appearance or relation alone underperform (50.65% and 47.52%),
- Simple late fusion reaches 51.96%,
- GAT fusion without TRS yields 55.35%,
- Full model with TRS achieves 57.70%.
Grid-based relation graphs (adjacency+symmetry) surpass prior AU/lankdmark strategies: 57.70% (grid), 52.22% (RA-UWML), 54.31% (DDRGCN), 55.61% (Chang et al.). Patch size 2 outperforms 3, 4, and 5 configurations for the grid.
These results indicate both the complementary nature of appearance and relational cues, and the effectiveness of TRS-constrained graph attention fusion for spatiotemporal expression analysis (Li et al., 27 Nov 2025).