ConxGNN: Dual-Approach Graph Neural Networks
- ConxGNN is a pair of distinct GNN architectures, one using dual-graph context modeling for multimodal emotion recognition and the other leveraging convection–diffusion dynamics for heterophilic node classification.
- The emotion recognition variant employs a multi-scale heterogeneous inception module and a hypergraph module with cross-modal attention to fuse text, audio, and visual cues.
- The convection–diffusion variant integrates learnable velocity terms into graph diffusion, enhancing feature propagation on heterophilic graphs and achieving substantial accuracy improvements.
ConxGNN refers to two unrelated Graph Neural Network (GNN) architectures: (1) a context modeling framework for multimodal emotion recognition in conversational dialogues, and (2) a convection–diffusion GNN for node classification on heterophilic graphs. Both present substantial methodological advances and have distinct application domains.
1. Overview and Definitions
The first ConxGNN, presented in "Effective Context Modeling Framework for Emotion Recognition in Conversations" (Van et al., 2024), is designed to address emotion recognition in conversation (ERC). It leverages GNNs to capture multi-scale contextual and multimodal dependencies using a dual-graph approach: a multi-scale heterogeneous inception GNN for temporal and pairwise context, and a hypergraph module for modeling high-order multimodal interactions.
The second ConxGNN, introduced as "Graph Neural Convection–Diffusion with Heterophily" (Zhao et al., 2023), is a GNN architecture motivated by the physical convection–diffusion equation (CDE), which is tailored for graphs where connected nodes often belong to different classes (heterophily). It augments graph diffusion message passing with a learnable convection term to handle non-smooth feature propagation, significantly improving performance on heterophilic benchmarks.
2. ConxGNN for Emotion Recognition in Conversations
2.1 Overall Architecture
The ConxGNN architecture for ERC comprises five sequential processing stages:
- Unimodal Encoders and Speaker Embedding: Each utterance is encoded independently in text, audio, and video modalities, mapped to a latent space, and augmented with a learned speaker embedding.
- Inception Graph Module (IGM): Multiple parallel heterogeneous GNN branches, each with distinct temporal windows, are responsible for capturing short- and long-range multimodal, temporal, and speaker-dependent dependencies.
- Hypergraph Module (HM): A hypergraph GNN that models high-order relationships among modalities and utterances (nodes), where hyperedges represent both modality-wide and intra-utterance structures.
- Fusion Module with Cross-Modal Attention: Outputs from IGM and HM are concatenated, projected, and fused via a cross-modal attention mechanism, primarily aligning audio and visual modalities to textual representations.
- Classifier with Class-Balanced Loss: A multilayer perceptron (MLP) classifier, trained with a loss function that employs class-based reweighting and focal contrastive components to address label imbalance and semantic proximity in emotions.
2.2 Multi-Scale Heterogeneous Graph Module
Each utterance-modality pair forms a node, with edge sets capturing both inter-modal (within-utterance) and temporal intra-modal (across utterances) connections. Multiple inception branches differ in their time window [p,f] parameters, enabling context aggregation at various scales. Edge weights are computed via angular similarity of node features.
Layer updates use a relation-aware k-dimensional GNN, mixing information from all relation types, followed by a Graph Transformer block (multi-head attention) for further refinement. Individual branch outputs are averaged to produce the final context vector per modality per utterance.
2.3 Hypergraph Module
The hypergraph operates on the same set of nodes as IGM. Modality-wide hyperedges connect all utterances for each modality; intra-utterance hyperedges connect all modalities within a single utterance. Each hyperedge has a learnable scalar weight. Hypergraph convolution proceeds with standard incidence-based normalization, propagating high-order contextual information.
2.4 Cross-Modal Attention Fusion
Feature vectors from IGM and HM are concatenated and projected. Cross-modal attention aligns non-textual modalities (audio, visual) to the textual modality, with the resulting attended features combined. The final per-utterance feature comprises the attended text, audio, and visual embeddings, followed by a ReLU projection and classification through a softmax MLP.
2.5 Class-Balanced Loss and Optimization
To account for class imbalance and semantic proximity between emotion categories, ConxGNN employs a reweighting scheme:
- Class-balanced Cross-Entropy (CBCE) uses effective number weighting per class, with
where is class count and a smoothing hyperparameter.
- Class-Balanced Focal Contrastive (CBFC) assigns higher loss to hard positive/negative pairs in the embedding space, scaled by class weights.
The final loss is a weighted sum: .
3. ConxGNN: Convection–Diffusion GNN for Heterophily
3.1 Mathematical Formulation
The model is grounded in the continuous convection–diffusion equation,
where is the diffusion coefficient and is a learned, feature-dependent velocity field.
On graphs, the discrete convection–diffusion equation becomes
with a learnable edge-wise diffusion map, an edge-wise velocity, and the two divergence terms representing graph diffusion (homophily) and convection (heterophily), respectively.
3.2 Message Passing and Computation
The time evolution is discretized (Euler or RK4 integration), yielding per-step updates:
0
where 1 is a diffusion weight, and 2 is the convection velocity parameterized as
3
with 4 learnable and 5 a nonlinearity.
This framework generalizes diffusion-only GNNs (such as GRAND, GraphBel) and is implemented via neural-ODE solvers supporting efficient backpropagation.
3.3 Training and Loss
The model is trained with standard cross-entropy loss on labeled nodes, plus 6 regularization. Dropout on edges or activations is optionally used. The neural-ODE adjoint method ensures memory efficiency in backpropagation.
4. Empirical Results and Comparative Performance
4.1 Multimodal ERC (ConxGNN-ERC)
On IEMOCAP (6 emotions), ConxGNN achieves 7 accuracy and 8 weighted-F1, an improvement of 9 over the previous state-of-the-art (CORECT). On MELD (7 emotions), it yields 0 accuracy and 1 weighted-F1, surpassing MM-DFN and 2Net by 3 and 4 points, respectively.
Ablation studies demonstrate the critical impact of the IGM (removal drops IEMOCAP accuracy from 5 to 6), HM (drop to 7), cross-modal fusion (drop 8), and class reweighting (drop 9).
4.2 Heterophilic Node Classification (ConxGNN-CDE)
On benchmarks with lowest adjusted homophily (Roman, Wiki, Minesweeper), ConxGNN achieves up to 0 accuracy, outperforming GCN, H2GCN, GRAND, and related baselines by margins up to 1 percentage points. Ablations confirm that inclusion of the convection term systematically lifts accuracy in all backbone architectures. Performance saturates for integration time 2.
5. Theoretical Insights and Architectural Contributions
5.1 Multimodal Context Modeling
The parallel dual-graph structure of ConxGNN-ERC distinctly separates local-pairwise context (via IGM, with multi-scale temporal reception) and high-order context (via HM, capturing multivariate, modality-spanning dependencies). This yields context representations richer than those from fixed-window or simple concatenation GNNs. Cross-modal attention in fusion stages aligns the most sentiment-relevant modalities (usually textual) while still allowing auxiliary modalities (audio, video) to contribute.
5.2 Convection–Diffusion Formalism and Heterophily
The CDE-based ConxGNN departs from classical diffusive smoothing by enabling adaptive, directed transfer of feature information along feature-difference axes. This is essential in heterophilic domains, where diffusion-only smoothing washes out label signals. The learnable velocity field, parameterized on feature differences, steers propagation dynamics toward preservation of inter-class distinctions, a theoretical and empirical advance over prior PDE-GNNs.
6. Limitations and Future Directions
Limitations in both ConxGNN variants include computational cost (multiple GNN branches, large RK4 integrators, or hypergraph layers), hyperparameter sensitivity (time window selection, integration time), and reliance on engineered unimodal encoders (e.g., no pretraining) (Van et al., 2024, Zhao et al., 2023). Open challenges identified include:
- Implicit/learned aggregation ranges to replace fixed windows or time horizons.
- Incorporation of more expressive unimodal embeddings (e.g., transformer-based pretrained models).
- Extension to additional tasks: dynamic graphs, robust link prediction, adversarial settings, and explicit speaker-graph modeling.
- Stability and scalability analyses for integration solvers in CDE-GNNs.
- More explicit speaker propagation mechanisms or per-speaker subgraph modeling in contextual ERC.
- Approaches for modeling edge features, higher-order convection, and adaptive integration step sizes.
7. Summary Table: ConxGNN Variants
| Variant | Core Mechanism | Primary Task |
|---|---|---|
| ConxGNN for ERC (Van et al., 2024) | Dual-graph (IGM+HM), multimodal, cross-modal attention, class-balanced loss | Emotion recognition in conversation |
| ConxGNN-CDE for Heterophily (Zhao et al., 2023) | Convection-diffusion ODE, learnable velocity field | Node classification on heterophilic graphs |
References
- "Effective Context Modeling Framework for Emotion Recognition in Conversations" (Van et al., 2024)
- "Graph Neural Convection-Diffusion with Heterophily" (Zhao et al., 2023)