ConxGNN: Dual-Approach Graph Neural Networks

Updated 12 April 2026

ConxGNN is a pair of distinct GNN architectures, one using dual-graph context modeling for multimodal emotion recognition and the other leveraging convection–diffusion dynamics for heterophilic node classification.
The emotion recognition variant employs a multi-scale heterogeneous inception module and a hypergraph module with cross-modal attention to fuse text, audio, and visual cues.
The convection–diffusion variant integrates learnable velocity terms into graph diffusion, enhancing feature propagation on heterophilic graphs and achieving substantial accuracy improvements.

ConxGNN refers to two unrelated Graph Neural Network (GNN) architectures: (1) a context modeling framework for multimodal emotion recognition in conversational dialogues, and (2) a convection–diffusion GNN for node classification on heterophilic graphs. Both present substantial methodological advances and have distinct application domains.

1. Overview and Definitions

The first ConxGNN, presented in "Effective Context Modeling Framework for Emotion Recognition in Conversations" (Van et al., 2024), is designed to address emotion recognition in conversation (ERC). It leverages GNNs to capture multi-scale contextual and multimodal dependencies using a dual-graph approach: a multi-scale heterogeneous inception GNN for temporal and pairwise context, and a hypergraph module for modeling high-order multimodal interactions.

The second ConxGNN, introduced as "Graph Neural Convection–Diffusion with Heterophily" (Zhao et al., 2023), is a GNN architecture motivated by the physical convection–diffusion equation (CDE), which is tailored for graphs where connected nodes often belong to different classes (heterophily). It augments graph diffusion message passing with a learnable convection term to handle non-smooth feature propagation, significantly improving performance on heterophilic benchmarks.

2. ConxGNN for Emotion Recognition in Conversations

2.1 Overall Architecture

The ConxGNN architecture for ERC comprises five sequential processing stages:

Unimodal Encoders and Speaker Embedding: Each utterance is encoded independently in text, audio, and video modalities, mapped to a latent space, and augmented with a learned speaker embedding.
Inception Graph Module (IGM): Multiple parallel heterogeneous GNN branches, each with distinct temporal windows, are responsible for capturing short- and long-range multimodal, temporal, and speaker-dependent dependencies.
Hypergraph Module (HM): A hypergraph GNN that models high-order relationships among modalities and utterances (nodes), where hyperedges represent both modality-wide and intra-utterance structures.
Fusion Module with Cross-Modal Attention: Outputs from IGM and HM are concatenated, projected, and fused via a cross-modal attention mechanism, primarily aligning audio and visual modalities to textual representations.
Classifier with Class-Balanced Loss: A multilayer perceptron (MLP) classifier, trained with a loss function that employs class-based reweighting and focal contrastive components to address label imbalance and semantic proximity in emotions.

2.2 Multi-Scale Heterogeneous Graph Module

Each utterance-modality pair forms a node, with edge sets capturing both inter-modal (within-utterance) and temporal intra-modal (across utterances) connections. Multiple inception branches differ in their time window [p,f] parameters, enabling context aggregation at various scales. Edge weights are computed via angular similarity of node features.

Layer updates use a relation-aware k-dimensional GNN, mixing information from all relation types, followed by a Graph Transformer block (multi-head attention) for further refinement. Individual branch outputs are averaged to produce the final context vector per modality per utterance.

2.3 Hypergraph Module

The hypergraph operates on the same set of nodes as IGM. Modality-wide hyperedges connect all utterances for each modality; intra-utterance hyperedges connect all modalities within a single utterance. Each hyperedge has a learnable scalar weight. Hypergraph convolution proceeds with standard incidence-based normalization, propagating high-order contextual information.

Feature vectors from IGM and HM are concatenated and projected. Cross-modal attention aligns non-textual modalities (audio, visual) to the textual modality, with the resulting attended features combined. The final per-utterance feature comprises the attended text, audio, and visual embeddings, followed by a ReLU projection and classification through a softmax MLP.

2.5 Class-Balanced Loss and Optimization

To account for class imbalance and semantic proximity between emotion categories, ConxGNN employs a reweighting scheme:

Class-balanced Cross-Entropy (CBCE) uses effective number weighting per class, with

$w_c = \frac{1 - \beta}{1 - \beta^{n_c}}$

where $n_c$ is class count and $\beta$ a smoothing hyperparameter.

Class-Balanced Focal Contrastive (CBFC) assigns higher loss to hard positive/negative pairs in the embedding space, scaled by class weights.

The final loss is a weighted sum: $L = L_{CBCE} + \mu L_{CBFC}$ .

3. ConxGNN: Convection–Diffusion GNN for Heterophily

3.1 Mathematical Formulation

The model is grounded in the continuous convection–diffusion equation,

$\frac{\partial x}{\partial t}(u, t) = \nabla \cdot (D(u,t)\nabla x(u, t)) - \nabla \cdot (v(u, t)x(u, t)),$

where $D$ is the diffusion coefficient and $v$ is a learned, feature-dependent velocity field.

On graphs, the discrete convection–diffusion equation becomes

$\frac{\partial X}{\partial t}(t) = \mathrm{div}(D(X, t)\odot\nabla X(t)) + \mathrm{div}(V(t)\circ X(t)),$

with $D$ a learnable edge-wise diffusion map, $V$ an edge-wise velocity, and the two divergence terms representing graph diffusion (homophily) and convection (heterophily), respectively.

3.2 Message Passing and Computation

The time evolution is discretized (Euler or RK4 integration), yielding per-step updates:

$n_c$ 0

where $n_c$ 1 is a diffusion weight, and $n_c$ 2 is the convection velocity parameterized as

$n_c$ 3

with $n_c$ 4 learnable and $n_c$ 5 a nonlinearity.

This framework generalizes diffusion-only GNNs (such as GRAND, GraphBel) and is implemented via neural-ODE solvers supporting efficient backpropagation.

3.3 Training and Loss

The model is trained with standard cross-entropy loss on labeled nodes, plus $n_c$ 6 regularization. Dropout on edges or activations is optionally used. The neural-ODE adjoint method ensures memory efficiency in backpropagation.

4. Empirical Results and Comparative Performance

4.1 Multimodal ERC (ConxGNN-ERC)

On IEMOCAP (6 emotions), ConxGNN achieves $n_c$ 7 accuracy and $n_c$ 8 weighted-F1, an improvement of $n_c$ 9 over the previous state-of-the-art (CORECT). On MELD (7 emotions), it yields $\beta$ 0 accuracy and $\beta$ 1 weighted-F1, surpassing MM-DFN and $\beta$ 2Net by $\beta$ 3 and $\beta$ 4 points, respectively.

Ablation studies demonstrate the critical impact of the IGM (removal drops IEMOCAP accuracy from $\beta$ 5 to $\beta$ 6), HM (drop to $\beta$ 7), cross-modal fusion (drop $\beta$ 8), and class reweighting (drop $\beta$ 9).

4.2 Heterophilic Node Classification (ConxGNN-CDE)

On benchmarks with lowest adjusted homophily (Roman, Wiki, Minesweeper), ConxGNN achieves up to $L = L_{CBCE} + \mu L_{CBFC}$ 0 accuracy, outperforming GCN, H2GCN, GRAND, and related baselines by margins up to $L = L_{CBCE} + \mu L_{CBFC}$ 1 percentage points. Ablations confirm that inclusion of the convection term systematically lifts accuracy in all backbone architectures. Performance saturates for integration time $L = L_{CBCE} + \mu L_{CBFC}$ 2.

5. Theoretical Insights and Architectural Contributions

5.1 Multimodal Context Modeling

The parallel dual-graph structure of ConxGNN-ERC distinctly separates local-pairwise context (via IGM, with multi-scale temporal reception) and high-order context (via HM, capturing multivariate, modality-spanning dependencies). This yields context representations richer than those from fixed-window or simple concatenation GNNs. Cross-modal attention in fusion stages aligns the most sentiment-relevant modalities (usually textual) while still allowing auxiliary modalities (audio, video) to contribute.

5.2 Convection–Diffusion Formalism and Heterophily

The CDE-based ConxGNN departs from classical diffusive smoothing by enabling adaptive, directed transfer of feature information along feature-difference axes. This is essential in heterophilic domains, where diffusion-only smoothing washes out label signals. The learnable velocity field, parameterized on feature differences, steers propagation dynamics toward preservation of inter-class distinctions, a theoretical and empirical advance over prior PDE-GNNs.

6. Limitations and Future Directions

Limitations in both ConxGNN variants include computational cost (multiple GNN branches, large RK4 integrators, or hypergraph layers), hyperparameter sensitivity (time window selection, integration time), and reliance on engineered unimodal encoders (e.g., no pretraining) (Van et al., 2024, Zhao et al., 2023). Open challenges identified include:

Implicit/learned aggregation ranges to replace fixed windows or time horizons.
Incorporation of more expressive unimodal embeddings (e.g., transformer-based pretrained models).
Extension to additional tasks: dynamic graphs, robust link prediction, adversarial settings, and explicit speaker-graph modeling.
Stability and scalability analyses for integration solvers in CDE-GNNs.
More explicit speaker propagation mechanisms or per-speaker subgraph modeling in contextual ERC.
Approaches for modeling edge features, higher-order convection, and adaptive integration step sizes.

7. Summary Table: ConxGNN Variants

Variant	Core Mechanism	Primary Task
ConxGNN for ERC (Van et al., 2024)	Dual-graph (IGM+HM), multimodal, cross-modal attention, class-balanced loss	Emotion recognition in conversation
ConxGNN-CDE for Heterophily (Zhao et al., 2023)	Convection-diffusion ODE, learnable velocity field	Node classification on heterophilic graphs

References

"Effective Context Modeling Framework for Emotion Recognition in Conversations" (Van et al., 2024)
"Graph Neural Convection-Diffusion with Heterophily" (Zhao et al., 2023)

Markdown Report Issue Upgrade to Chat

References (2)

Effective Context Modeling Framework for Emotion Recognition in Conversations (2024)

Graph Neural Convection-Diffusion with Heterophily (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConxGNN.