ARPGNet: Appearance & Relation-aware Graph Fusion

Updated 2 December 2025

The paper presents a novel approach that fuses spatial appearance and relational cues with temporal context for improved facial expression recognition.
It employs a dual-branch network combining an InsightFace ResNet-50 backbone with graph attention networks to capture local symmetry and adjacency relations.
Benchmark results and ablation studies demonstrate significant performance gains over conventional CNN-based FER systems across multiple datasets.

The Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet) is an architecture developed for facial expression recognition (FER) that jointly models both facial appearance and inter-region relations with explicit cross-modality and temporal fusion. ARPGNet utilizes spatial facial appearance representations alongside a graph-based encoding of local relational cues, harnessed and fused with graph attention mechanisms which also incorporate local temporal context. This approach addresses deficiencies in conventional FER systems that focus exclusively on appearance features extracted by Convolutional Neural Networks (CNNs), neglecting the structured relationships between facial subregions and their temporal interplay (Li et al., 27 Nov 2025).

1. Architecture Overview

ARPGNet consists of three principal modules: (1) a facial region relation graph branch, (2) an appearance representation backbone, and (3) a parallel graph attention fusion module. The overall workflow is as follows:

Facial Region Relation Graph: Convolutional feature maps are subdivided into a $P \times P$ grid, constructing a graph where each node is a facial patch and edges encode local adjacency and bilateral symmetry. Graph Attention Networks (GATs) are then applied to aggregate regional and symmetric context per frame.
Appearance Backbone: Features are extracted per frame by an InsightFace ResNet-50 CNN pre-trained on MS-Celeb-1M, then fine-tuned on FER datasets; the first fully connected (FC) layer serves as the appearance embedding.
Parallel Fusion: Temporal sequences from both branches are enhanced and fused in a graph attention scheme that operates on a constructed fusion graph, incorporating a temporal response scope (TRS) constraint to promote local temporal modeling and interaction between appearance and relation features without recurrence.

This design enables ARPGNet to simultaneously learn and integrate complementary intra-frame (appearance, relation) and inter-frame (temporal) cues (Li et al., 27 Nov 2025).

2. Facial Region Relation Graph Construction

After extracting convolutional feature maps (size $H=W=12$ ), adaptive average pooling divides the response into a $P \times P$ grid of patches, yielding $V=\{1,2,\ldots,P^2\}$ nodes per frame. The graph's fixed adjacency matrix $A \in \{0,1\}^{P^2 \times P^2}$ is defined by:

Self-connection,
4-neighborhood (spatial adjacency),
Mirror symmetry about the vertical facial midline.

Formally,

$a_{ij} = \begin{cases} 1, & i = j, \ 1, & \text{%%%%5%%%% are spatial grid neighbors}, \ 1, & \text{%%%%6%%%% are symmetric about the midline}, \ 0, & \text{otherwise}. \end{cases}$

Three stacked GAT layers (attention heads, LeakyReLU activations) update node features: $e_{ij} = \mathrm{LeakyReLU}\left( \mathbf{a}^\top [h_i \parallel h_j] \right), \quad \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k\in N_i}\exp(e_{ik})}, \quad x'_i = \sigma\left( \sum_{j\in N_i} \alpha_{ij} h_j \right)$ with multi-head aggregation. Global average pooling yields per-frame relation embeddings $r_t \in \mathbb{R}^{C'}$ for temporal sequence $R = [r_1;\ldots;r_T]$ (Li et al., 27 Nov 2025).

3. Appearance Feature Extraction Backbone

The appearance branch uses the InsightFace ResNet-50 backbone, structured as conv1 $\rightarrow$ Res2–Res5 $H=W=12$ 0 two FC layers. The first FC layer's output, $H=W=12$ 1, is used as the per-frame appearance feature. The network is fine-tuned on each target emotion dataset. The result is a temporal sequence $H=W=12$ 2 (Li et al., 27 Nov 2025).

4. Parallel Graph Attention Fusion with Temporal Response Scope

To enable effective fusion and temporal modeling:

Positional Encoding: Standard sine/cosine encodings are added to $H=W=12$ 3 and $H=W=12$ 4 before fusion,

$H=W=12$ 5

Fusion Graph: The fusion graph $H=W=12$ 6 has $H=W=12$ 7 nodes—appearance ( $H=W=12$ 8) and relation ( $H=W=12$ 9)—with edges determined by the TRS hyperparameter:

$P \times P$ 0

This allows inter- and intra-modality attention within a $P \times P$ 1 time window.

GAT Fusion: The same multi-head GAT formulation is applied across this graph:

$P \times P$ 2

producing enhanced appearance ( $P \times P$ 3) and relation ( $P \times P$ 4) sequences.

Aggregation and Classification: At each timestep $P \times P$ 5, $P \times P$ 6, which are temporally pooled (mean) to form a video representation $P \times P$ 7 for MLP-based expression classification (Li et al., 27 Nov 2025).

5. Temporal Sequence Modeling

ARPGNet integrates temporal modeling directly within the fusion graph by leveraging the TRS constraint. This enforces attention locality in time, allowing the network to model local spatiotemporal dynamics and fine-grained cross-modality effects without introducing explicit sequence models like RNNs or global Transformers. This parallel, locality-aware attention mechanism improves cross-modal temporal interaction and reduces modeling complexity (Li et al., 27 Nov 2025).

6. Training Protocols, Data, and Implementation

Data Preprocessing: Detection, alignment, and cropping to $P \times P$ 8 are performed via OpenFace 2.0. For RML and AFEW, $P \times P$ 9 frames per video are sparsely sampled (random for training, uniform for evaluation); Aff-wild2 employs dilated sequences of 8 frames (dilation=3).
Optimization: Adam optimizer ( $V=\{1,2,\ldots,P^2\}$ 0, $V=\{1,2,\ldots,P^2\}$ 1, $V=\{1,2,\ldots,P^2\}$ 2). Learning rates: $V=\{1,2,\ldots,P^2\}$ 3 (ResNet blocks), $V=\{1,2,\ldots,P^2\}$ 4 (others). Dropout (0.25) follows all FC layers. Kaiming initialization is used; PReLU/LeakyReLU activations are adopted in ResNet/GAT, respectively.
Losses: Standard multi-class cross-entropy for balanced datasets (RML, AFEW):

$V=\{1,2,\ldots,P^2\}$ 5

and focal loss ( $V=\{1,2,\ldots,P^2\}$ 6) for class-imbalanced Aff-wild2:

$V=\{1,2,\ldots,P^2\}$ 7

No additional regularization beyond dropout is used (Li et al., 27 Nov 2025).

Implementation: PyTorch platform on NVIDIA TITAN RTX GPUs. Batch sizes and epochs typically range $V=\{1,2,\ldots,P^2\}$ 8– $V=\{1,2,\ldots,P^2\}$ 9 and $A \in \{0,1\}^{P^2 \times P^2}$ 0– $A \in \{0,1\}^{P^2 \times P^2}$ 1, respectively.

7. Benchmark Results and Ablation Analysis

ARPGNet demonstrates consistently strong or superior performance across major FER datasets:

Dataset	State-of-the-art Baselines	ARPGNet Performance	ARPGNet w/pretraining
RML	MulT 73.3%, C3D+audio 73.9%	76.53±3.05%	—
AFEW	HSE-NN 59.3%, MulT 55.87%	57.70%	60.05%
Aff-wild2	MulT 0.536 (𝑀), HSE-NN 0.521	0.547	0.628

Ablation studies on AFEW reveal that:

Single-stream appearance or relation alone underperform (50.65% and 47.52%),
Simple late fusion reaches 51.96%,
GAT fusion without TRS yields 55.35%,
Full model with TRS achieves 57.70%.

Grid-based relation graphs (adjacency+symmetry) surpass prior AU/lankdmark strategies: 57.70% (grid), 52.22% (RA-UWML), 54.31% (DDRGCN), 55.61% (Chang et al.). Patch size $A \in \{0,1\}^{P^2 \times P^2}$ 2 outperforms $A \in \{0,1\}^{P^2 \times P^2}$ 3, $A \in \{0,1\}^{P^2 \times P^2}$ 4, and $A \in \{0,1\}^{P^2 \times P^2}$ 5 configurations for the grid.

These results indicate both the complementary nature of appearance and relational cues, and the effectiveness of TRS-constrained graph attention fusion for spatiotemporal expression analysis (Li et al., 27 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ARPGNet: Appearance- and Relation-aware Parallel Graph Attention Fusion Network for Facial Expression Recognition (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet).

ARPGNet: Appearance & Relation-aware Graph Fusion

1. Architecture Overview

2. Facial Region Relation Graph Construction

3. Appearance Feature Extraction Backbone

4. Parallel Graph Attention Fusion with Temporal Response Scope

5. Temporal Sequence Modeling

6. Training Protocols, Data, and Implementation

7. Benchmark Results and Ablation Analysis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ARPGNet: Appearance & Relation-aware Graph Fusion

1. Architecture Overview

2. Facial Region Relation Graph Construction

3. Appearance Feature Extraction Backbone

4. Parallel Graph Attention Fusion with Temporal Response Scope

5. Temporal Sequence Modeling

6. Training Protocols, Data, and Implementation

7. Benchmark Results and Ablation Analysis

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research