Multi-Relational Dialogue Attention Network

Updated 17 November 2025

The paper introduces MR-DAN which captures multi-speaker dialogue dependencies by modeling temporal, speaker-continuity, semantic, and self-loop relations.
It employs multi-head, edge-type-specific graph attention combined with audio and speaker embeddings to build dense, context-aware utterance representations.
Integration with multimodal foundation models and LoRA adapters enables robust, end-to-end acoustic-to-intent inference with minimal supervision.

The Multi-Relational Dialogue Attention Network (MR-DAN) is a neural architecture designed to capture the heterogeneous structural dependencies within multi-speaker audio dialogues, facilitating effective acoustic-to-intent inference in scenarios where utterances exhibit complex inter-utterance and speaker dependencies. Introduced as the core graph module in the DialogGraph-LLM framework, MR-DAN fuses relation-aware graph-based attention with multimodal foundation models, enabling end-to-end audio dialogue intent recognition with minimal supervision (Liu et al., 14 Nov 2025).

1. Purpose and Core Architecture

MR-DAN's primary objective is to encode the intricate internal structure of spoken dialogues—including speaker turns, temporal sequencing, and semantic interactions between both proximate and distant utterances—into dense representations suitable for intent classification. The high-level pipeline comprises:

Audio encoder (Qwen2.5-Omni-audio): Transforms each utterance into a dense embedding $h_i$ .
Speaker embedding lookup: Each speaker $s_i$ is mapped to an embedding $e_{s_i}$ .
Initial node features: Utterance and speaker embeddings are concatenated and linearly projected: $x_i = W_p [h_i; e_{s_i}]$ .
Multi-relational dialogue graph $G = (V, E)$ : Nodes represent utterances; edges are of four typed, directed relations.
MR-DAN layers: Employ multi-head, edge-type-specific graph attention mechanisms.
Graph pooling and LLM integration: Node representations are pooled and combined, via prompt engineering, with a global audio embedding for downstream intent classification.

This modular decomposition facilitates the capture of relational priors critical for modeling speaker- and discourse-level context in conversational AI.

2. Dialogue Graph Construction

2.1 Node Representations

The dialogue audio $A$ is segmented into $M$ utterances $\{a_1, \dots, a_M\}$ , each tagged by a speaker $s_i$ . The pre-trained audio encoder $\Phi$ generates utterance embeddings $h_i = \Phi(a_i) \in \mathbb{R}^{d_h}$ . Distinct speaker embeddings $e_{s_i} \in \mathbb{R}^{d_s}$ are learned and concatenated with $h_i$ . The resulting vectors are projected to form initial node features:

$x_i = W_p\,[h_i; e_{s_i}] \in \mathbb{R}^{d_{\mathrm{model}}}, \quad W_p \in \mathbb{R}^{(d_h + d_s) \times d_{\mathrm{model}}}$

2.2 Edge Types

MR-DAN defines four relation types, establishing a heterogeneous directed graph. For node $i$ , let $\mathcal{N}_t(i)$ denote nodes with an incoming edge of type $t$ :

Temporal edges ( $t = 1$ ): $(i-1) \to i$ , encoding sequential turn-taking.
Speaker-continuity edges ( $t = 2$ ): Edges from the $k$ most recent utterances $j < i$ by the same speaker ( $s_j = s_i$ ), capturing per-speaker dialogue threads.
Cross-utterance (semantic) edges ( $t = 3$ ): Connect node $i$ to $j < i$ if either $j=i-1$ or $\cos(x_j, x_i) > \theta$ with $\theta$ a learnable threshold, representing long-range semantic dependencies and topical jumps.
Self-loops ( $t = 4$ ): $(i \to i)$ , supporting positional context.

By coding these relational structures, MR-DAN enables explicit modeling of multifaceted conversational context beyond linear utterance ordering.

3. Multi-Relational Graph Attention Mechanism

Let $T = 4$ denote the number of edge types, and $H$ the total number of attention heads, partitioned by type: $H_1, \dots, H_T$ . Each attention head $h \in H_t$ for relation type $t$ uses separate learned parameters $(W_Q^h, W_K^h, W_V^h)$ .

3.1 Attention Score Computation

For node $i$ , relation $t$ , and head $h \in H_t$ , attention scores are computed only over $\mathcal{N}_t(i)$ :

$e_{j,i}^h = \frac{(W_Q^h\,x_i)^\top(W_K^h\,x_j)}{\sqrt{d_k}}$

where $d_k = d_{\mathrm{model}} / H$ .

3.2 Attention Weights and Value Aggregation

Scores are softmax-normalized over $\mathcal{N}_t(i)$ :

$\alpha_{j,i}^h = \frac{\exp(e_{j,i}^h)}{\sum_{k \in \mathcal{N}_t(i)} \exp(e_{k,i}^h)}$

Each head aggregates neighbor features:

$z_{i, t, h} = \sum_{j \in \mathcal{N}_t(i)} \alpha_{j,i}^h\, W_V^{h} x_j$

3.3 Relation-Specific and Final Node Update

Heads within each relation $t$ are concatenated: $z_{i, t} = \Vert_{h \in H_t} z_{i, t, h}$ . The per-relation outputs are then concatenated, linearly projected, and merged with a residual connection and layer norm:

$x_i^{(\ell+1)} = \mathrm{LN}\Bigl( x_i^{(\ell)} + W_O[z_{i,1}\,\Vert\,z_{i,2}\,\Vert\,z_{i,3}\,\Vert\,z_{i,4}] \Bigr)$

where $W_O \in \mathbb{R}^{(T\,|H_t|\,d_k)\times d_{\mathrm{model}}}$ .

Multiple MR-DAN layers are stacked (typically 2–4) to increase representational depth.

4. Handling Relation Types and Multi-Head Parameterization

Partitioning attention heads by relation type, with each head set $H_t$ and parameter triplet $(W_Q^h, W_K^h, W_V^h)$ , grants MR-DAN the capacity to model each edge type with a distinct aggregation function. Temporal, speaker-continuity, and semantic edges are processed independently, while their outputs are fused at each layer. This architecture supports heterogeneous information multiplexing and preserves the representational flexibility inherent to multi-head attention schemes.

5. Integration with Multimodal Foundation Models

Following $L$ MR-DAN layers, mean pooling yields the graph-level embedding:

$g = \mathrm{MeanPool}\left(\{x_i^{(L)}\}_{i=1}^M\right)$

In parallel, the entire audio dialogue $A$ is encoded as $G=\Phi(A)$ . These representations are injected as continuous multimodal tokens into a prompt for the LLM (Qwen2.5-Omni-7B):

$\texttt{[Instruction] + <GRAPH> + <AUDIO>}$

where $<\!\!{\text{GRAPH}}\!\!> \mapsto g$ and $<\!\!{\text{AUDIO}}\!\!> \mapsto G$ appear among the first tokens of the LLM’s transformer stack. The final output head, fine-tuned via parameter-efficient LoRA adapters, emits a softmax distribution over intent classes.

6. Forward Pass Implementation Summary

The stepwise MR-DAN forward computation can be summarized as follows:

7. Training and Optimization Objectives

Supervised learning is conducted via standard cross-entropy loss over labeled data:

$\mathcal{L}_{\mathrm{sup}} = -\sum_{i=1}^{N_L} \sum_{c=1}^K \mathbf{1}\{y_i=c\} \log(\hat{p}_{i,c})$

Unlabeled data is incorporated using an adaptive semi-supervised strategy: pseudo-labels are generated with class- and global-confidence dual-thresholding, $\Delta$ -margin filtering, and entropy-based sample prioritization. High-confidence pseudo-labels are mixed into the loss. Regularization is provided via AdamW weight decay, LoRA dropout on adapter matrices, and LayerNorm within MR-DAN to ensure stable optimization.

All graph parameters ( $W_Q^h, W_K^h, W_V^h, W_O$ ), speaker embeddings, and LoRA adapter weights for downstream LLM are updated jointly via end-to-end backpropagation.

MR-DAN, as presented in DialogGraph-LLM, operationalizes multi-relational graph attention mechanisms for audio dialogue understanding, providing an explicit and highly expressive framework for encoding the complex relational structure of multi-party spoken interactions (Liu et al., 14 Nov 2025). Its careful architectural integration with large multimodal foundation models renders it suitable for intent recognition in resource-scarce, audio-rich domains.

PDF Markdown Chat (Pro)

References (1)

DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Relational Dialogue Attention Network (MR-DAN).

Multi-Relational Dialogue Attention Network

1. Purpose and Core Architecture

2. Dialogue Graph Construction

2.1 Node Representations

2.2 Edge Types

3. Multi-Relational Graph Attention Mechanism

3.1 Attention Score Computation

3.2 Attention Weights and Value Aggregation

3.3 Relation-Specific and Final Node Update

4. Handling Relation Types and Multi-Head Parameterization

5. Integration with Multimodal Foundation Models

6. Forward Pass Implementation Summary

7. Training and Optimization Objectives

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-Relational Dialogue Attention Network

1. Purpose and Core Architecture

2. Dialogue Graph Construction

2.1 Node Representations

2.2 Edge Types

3. Multi-Relational Graph Attention Mechanism

3.1 Attention Score Computation

3.2 Attention Weights and Value Aggregation

3.3 Relation-Specific and Final Node Update

4. Handling Relation Types and Multi-Head Parameterization

5. Integration with Multimodal Foundation Models

6. Forward Pass Implementation Summary

7. Training and Optimization Objectives

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research