Papers
Topics
Authors
Recent
Search
2000 character limit reached

ARPGNet: Appearance & Relation-aware Graph Fusion

Updated 2 December 2025
  • The paper presents a novel approach that fuses spatial appearance and relational cues with temporal context for improved facial expression recognition.
  • It employs a dual-branch network combining an InsightFace ResNet-50 backbone with graph attention networks to capture local symmetry and adjacency relations.
  • Benchmark results and ablation studies demonstrate significant performance gains over conventional CNN-based FER systems across multiple datasets.

The Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet) is an architecture developed for facial expression recognition (FER) that jointly models both facial appearance and inter-region relations with explicit cross-modality and temporal fusion. ARPGNet utilizes spatial facial appearance representations alongside a graph-based encoding of local relational cues, harnessed and fused with graph attention mechanisms which also incorporate local temporal context. This approach addresses deficiencies in conventional FER systems that focus exclusively on appearance features extracted by Convolutional Neural Networks (CNNs), neglecting the structured relationships between facial subregions and their temporal interplay (Li et al., 27 Nov 2025).

1. Architecture Overview

ARPGNet consists of three principal modules: (1) a facial region relation graph branch, (2) an appearance representation backbone, and (3) a parallel graph attention fusion module. The overall workflow is as follows:

  • Facial Region Relation Graph: Convolutional feature maps are subdivided into a P×PP \times P grid, constructing a graph where each node is a facial patch and edges encode local adjacency and bilateral symmetry. Graph Attention Networks (GATs) are then applied to aggregate regional and symmetric context per frame.
  • Appearance Backbone: Features are extracted per frame by an InsightFace ResNet-50 CNN pre-trained on MS-Celeb-1M, then fine-tuned on FER datasets; the first fully connected (FC) layer serves as the appearance embedding.
  • Parallel Fusion: Temporal sequences from both branches are enhanced and fused in a graph attention scheme that operates on a constructed fusion graph, incorporating a temporal response scope (TRS) constraint to promote local temporal modeling and interaction between appearance and relation features without recurrence.

This design enables ARPGNet to simultaneously learn and integrate complementary intra-frame (appearance, relation) and inter-frame (temporal) cues (Li et al., 27 Nov 2025).

2. Facial Region Relation Graph Construction

After extracting convolutional feature maps (size H=W=12H=W=12), adaptive average pooling divides the response into a P×PP \times P grid of patches, yielding V={1,2,,P2}V=\{1,2,\ldots,P^2\} nodes per frame. The graph's fixed adjacency matrix A{0,1}P2×P2A \in \{0,1\}^{P^2 \times P^2} is defined by:

  • Self-connection,
  • 4-neighborhood (spatial adjacency),
  • Mirror symmetry about the vertical facial midline.

Formally,

$a_{ij} = \begin{cases} 1, & i = j, \ 1, & \text{%%%%5%%%% are spatial grid neighbors}, \ 1, & \text{%%%%6%%%% are symmetric about the midline}, \ 0, & \text{otherwise}. \end{cases}$

Three stacked GAT layers (attention heads, LeakyReLU activations) update node features: eij=LeakyReLU(a[hihj]),αij=exp(eij)kNiexp(eik),xi=σ(jNiαijhj)e_{ij} = \mathrm{LeakyReLU}\left( \mathbf{a}^\top [h_i \parallel h_j] \right), \quad \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k\in N_i}\exp(e_{ik})}, \quad x'_i = \sigma\left( \sum_{j\in N_i} \alpha_{ij} h_j \right) with multi-head aggregation. Global average pooling yields per-frame relation embeddings rtRCr_t \in \mathbb{R}^{C'} for temporal sequence R=[r1;;rT]R = [r_1;\ldots;r_T] (Li et al., 27 Nov 2025).

3. Appearance Feature Extraction Backbone

The appearance branch uses the InsightFace ResNet-50 backbone, structured as conv1 \rightarrow Res2–Res5 H=W=12H=W=120 two FC layers. The first FC layer's output, H=W=12H=W=121, is used as the per-frame appearance feature. The network is fine-tuned on each target emotion dataset. The result is a temporal sequence H=W=12H=W=122 (Li et al., 27 Nov 2025).

4. Parallel Graph Attention Fusion with Temporal Response Scope

To enable effective fusion and temporal modeling:

  • Positional Encoding: Standard sine/cosine encodings are added to H=W=12H=W=123 and H=W=12H=W=124 before fusion,

H=W=12H=W=125

  • Fusion Graph: The fusion graph H=W=12H=W=126 has H=W=12H=W=127 nodes—appearance (H=W=12H=W=128) and relation (H=W=12H=W=129)—with edges determined by the TRS hyperparameter:

P×PP \times P0

This allows inter- and intra-modality attention within a P×PP \times P1 time window.

  • GAT Fusion: The same multi-head GAT formulation is applied across this graph:

P×PP \times P2

producing enhanced appearance (P×PP \times P3) and relation (P×PP \times P4) sequences.

  • Aggregation and Classification: At each timestep P×PP \times P5, P×PP \times P6, which are temporally pooled (mean) to form a video representation P×PP \times P7 for MLP-based expression classification (Li et al., 27 Nov 2025).

5. Temporal Sequence Modeling

ARPGNet integrates temporal modeling directly within the fusion graph by leveraging the TRS constraint. This enforces attention locality in time, allowing the network to model local spatiotemporal dynamics and fine-grained cross-modality effects without introducing explicit sequence models like RNNs or global Transformers. This parallel, locality-aware attention mechanism improves cross-modal temporal interaction and reduces modeling complexity (Li et al., 27 Nov 2025).

6. Training Protocols, Data, and Implementation

  • Data Preprocessing: Detection, alignment, and cropping to P×PP \times P8 are performed via OpenFace 2.0. For RML and AFEW, P×PP \times P9 frames per video are sparsely sampled (random for training, uniform for evaluation); Aff-wild2 employs dilated sequences of 8 frames (dilation=3).
  • Optimization: Adam optimizer (V={1,2,,P2}V=\{1,2,\ldots,P^2\}0, V={1,2,,P2}V=\{1,2,\ldots,P^2\}1, V={1,2,,P2}V=\{1,2,\ldots,P^2\}2). Learning rates: V={1,2,,P2}V=\{1,2,\ldots,P^2\}3 (ResNet blocks), V={1,2,,P2}V=\{1,2,\ldots,P^2\}4 (others). Dropout (0.25) follows all FC layers. Kaiming initialization is used; PReLU/LeakyReLU activations are adopted in ResNet/GAT, respectively.
  • Losses: Standard multi-class cross-entropy for balanced datasets (RML, AFEW):

V={1,2,,P2}V=\{1,2,\ldots,P^2\}5

and focal loss (V={1,2,,P2}V=\{1,2,\ldots,P^2\}6) for class-imbalanced Aff-wild2:

V={1,2,,P2}V=\{1,2,\ldots,P^2\}7

No additional regularization beyond dropout is used (Li et al., 27 Nov 2025).

  • Implementation: PyTorch platform on NVIDIA TITAN RTX GPUs. Batch sizes and epochs typically range V={1,2,,P2}V=\{1,2,\ldots,P^2\}8–V={1,2,,P2}V=\{1,2,\ldots,P^2\}9 and A{0,1}P2×P2A \in \{0,1\}^{P^2 \times P^2}0–A{0,1}P2×P2A \in \{0,1\}^{P^2 \times P^2}1, respectively.

7. Benchmark Results and Ablation Analysis

ARPGNet demonstrates consistently strong or superior performance across major FER datasets:

Dataset State-of-the-art Baselines ARPGNet Performance ARPGNet w/pretraining
RML MulT 73.3%, C3D+audio 73.9% 76.53±3.05%
AFEW HSE-NN 59.3%, MulT 55.87% 57.70% 60.05%
Aff-wild2 MulT 0.536 (𝑀), HSE-NN 0.521 0.547 0.628

Ablation studies on AFEW reveal that:

  • Single-stream appearance or relation alone underperform (50.65% and 47.52%),
  • Simple late fusion reaches 51.96%,
  • GAT fusion without TRS yields 55.35%,
  • Full model with TRS achieves 57.70%.

Grid-based relation graphs (adjacency+symmetry) surpass prior AU/lankdmark strategies: 57.70% (grid), 52.22% (RA-UWML), 54.31% (DDRGCN), 55.61% (Chang et al.). Patch size A{0,1}P2×P2A \in \{0,1\}^{P^2 \times P^2}2 outperforms A{0,1}P2×P2A \in \{0,1\}^{P^2 \times P^2}3, A{0,1}P2×P2A \in \{0,1\}^{P^2 \times P^2}4, and A{0,1}P2×P2A \in \{0,1\}^{P^2 \times P^2}5 configurations for the grid.

These results indicate both the complementary nature of appearance and relational cues, and the effectiveness of TRS-constrained graph attention fusion for spatiotemporal expression analysis (Li et al., 27 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet).