Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Relational Dialogue Attention Network

Updated 17 November 2025
  • The paper introduces MR-DAN which captures multi-speaker dialogue dependencies by modeling temporal, speaker-continuity, semantic, and self-loop relations.
  • It employs multi-head, edge-type-specific graph attention combined with audio and speaker embeddings to build dense, context-aware utterance representations.
  • Integration with multimodal foundation models and LoRA adapters enables robust, end-to-end acoustic-to-intent inference with minimal supervision.

The Multi-Relational Dialogue Attention Network (MR-DAN) is a neural architecture designed to capture the heterogeneous structural dependencies within multi-speaker audio dialogues, facilitating effective acoustic-to-intent inference in scenarios where utterances exhibit complex inter-utterance and speaker dependencies. Introduced as the core graph module in the DialogGraph-LLM framework, MR-DAN fuses relation-aware graph-based attention with multimodal foundation models, enabling end-to-end audio dialogue intent recognition with minimal supervision (Liu et al., 14 Nov 2025).

1. Purpose and Core Architecture

MR-DAN's primary objective is to encode the intricate internal structure of spoken dialogues—including speaker turns, temporal sequencing, and semantic interactions between both proximate and distant utterances—into dense representations suitable for intent classification. The high-level pipeline comprises:

  • Audio encoder (Qwen2.5-Omni-audio): Transforms each utterance into a dense embedding hih_i.
  • Speaker embedding lookup: Each speaker sis_i is mapped to an embedding esie_{s_i}.
  • Initial node features: Utterance and speaker embeddings are concatenated and linearly projected: xi=Wp[hi;esi]x_i = W_p [h_i; e_{s_i}].
  • Multi-relational dialogue graph G=(V,E)G = (V, E): Nodes represent utterances; edges are of four typed, directed relations.
  • MR-DAN layers: Employ multi-head, edge-type-specific graph attention mechanisms.
  • Graph pooling and LLM integration: Node representations are pooled and combined, via prompt engineering, with a global audio embedding for downstream intent classification.

This modular decomposition facilitates the capture of relational priors critical for modeling speaker- and discourse-level context in conversational AI.

2. Dialogue Graph Construction

2.1 Node Representations

The dialogue audio AA is segmented into MM utterances {a1,,aM}\{a_1, \dots, a_M\}, each tagged by a speaker sis_i. The pre-trained audio encoder Φ\Phi generates utterance embeddings hi=Φ(ai)Rdhh_i = \Phi(a_i) \in \mathbb{R}^{d_h}. Distinct speaker embeddings esiRdse_{s_i} \in \mathbb{R}^{d_s} are learned and concatenated with hih_i. The resulting vectors are projected to form initial node features:

xi=Wp[hi;esi]Rdmodel,WpR(dh+ds)×dmodelx_i = W_p\,[h_i; e_{s_i}] \in \mathbb{R}^{d_{\mathrm{model}}}, \quad W_p \in \mathbb{R}^{(d_h + d_s) \times d_{\mathrm{model}}}

2.2 Edge Types

MR-DAN defines four relation types, establishing a heterogeneous directed graph. For node ii, let Nt(i)\mathcal{N}_t(i) denote nodes with an incoming edge of type tt:

  1. Temporal edges (t=1t = 1): (i1)i(i-1) \to i, encoding sequential turn-taking.
  2. Speaker-continuity edges (t=2t = 2): Edges from the kk most recent utterances j<ij < i by the same speaker (sj=sis_j = s_i), capturing per-speaker dialogue threads.
  3. Cross-utterance (semantic) edges (t=3t = 3): Connect node ii to j<ij < i if either j=i1j=i-1 or cos(xj,xi)>θ\cos(x_j, x_i) > \theta with θ\theta a learnable threshold, representing long-range semantic dependencies and topical jumps.
  4. Self-loops (t=4t = 4): (ii)(i \to i), supporting positional context.

By coding these relational structures, MR-DAN enables explicit modeling of multifaceted conversational context beyond linear utterance ordering.

3. Multi-Relational Graph Attention Mechanism

Let T=4T = 4 denote the number of edge types, and HH the total number of attention heads, partitioned by type: H1,,HTH_1, \dots, H_T. Each attention head hHth \in H_t for relation type tt uses separate learned parameters (WQh,WKh,WVh)(W_Q^h, W_K^h, W_V^h).

3.1 Attention Score Computation

For node ii, relation tt, and head hHth \in H_t, attention scores are computed only over Nt(i)\mathcal{N}_t(i):

ej,ih=(WQhxi)(WKhxj)dke_{j,i}^h = \frac{(W_Q^h\,x_i)^\top(W_K^h\,x_j)}{\sqrt{d_k}}

where dk=dmodel/Hd_k = d_{\mathrm{model}} / H.

3.2 Attention Weights and Value Aggregation

Scores are softmax-normalized over Nt(i)\mathcal{N}_t(i):

αj,ih=exp(ej,ih)kNt(i)exp(ek,ih)\alpha_{j,i}^h = \frac{\exp(e_{j,i}^h)}{\sum_{k \in \mathcal{N}_t(i)} \exp(e_{k,i}^h)}

Each head aggregates neighbor features:

zi,t,h=jNt(i)αj,ihWVhxjz_{i, t, h} = \sum_{j \in \mathcal{N}_t(i)} \alpha_{j,i}^h\, W_V^{h} x_j

3.3 Relation-Specific and Final Node Update

Heads within each relation tt are concatenated: zi,t=hHtzi,t,hz_{i, t} = \Vert_{h \in H_t} z_{i, t, h}. The per-relation outputs are then concatenated, linearly projected, and merged with a residual connection and layer norm:

xi(+1)=LN(xi()+WO[zi,1zi,2zi,3zi,4])x_i^{(\ell+1)} = \mathrm{LN}\Bigl( x_i^{(\ell)} + W_O[z_{i,1}\,\Vert\,z_{i,2}\,\Vert\,z_{i,3}\,\Vert\,z_{i,4}] \Bigr)

where WOR(THtdk)×dmodelW_O \in \mathbb{R}^{(T\,|H_t|\,d_k)\times d_{\mathrm{model}}}.

Multiple MR-DAN layers are stacked (typically 2–4) to increase representational depth.

4. Handling Relation Types and Multi-Head Parameterization

Partitioning attention heads by relation type, with each head set HtH_t and parameter triplet (WQh,WKh,WVh)(W_Q^h, W_K^h, W_V^h), grants MR-DAN the capacity to model each edge type with a distinct aggregation function. Temporal, speaker-continuity, and semantic edges are processed independently, while their outputs are fused at each layer. This architecture supports heterogeneous information multiplexing and preserves the representational flexibility inherent to multi-head attention schemes.

5. Integration with Multimodal Foundation Models

Following LL MR-DAN layers, mean pooling yields the graph-level embedding:

g=MeanPool({xi(L)}i=1M)g = \mathrm{MeanPool}\left(\{x_i^{(L)}\}_{i=1}^M\right)

In parallel, the entire audio dialogue AA is encoded as G=Φ(A)G=\Phi(A). These representations are injected as continuous multimodal tokens into a prompt for the LLM (Qwen2.5-Omni-7B):

[Instruction] + <GRAPH> + <AUDIO>\texttt{[Instruction] + <GRAPH> + <AUDIO>}

where < ⁣ ⁣GRAPH ⁣ ⁣>g<\!\!{\text{GRAPH}}\!\!> \mapsto g and < ⁣ ⁣AUDIO ⁣ ⁣>G<\!\!{\text{AUDIO}}\!\!> \mapsto G appear among the first tokens of the LLM’s transformer stack. The final output head, fine-tuned via parameter-efficient LoRA adapters, emits a softmax distribution over intent classes.

6. Forward Pass Implementation Summary

The stepwise MR-DAN forward computation can be summarized as follows:

1

7. Training and Optimization Objectives

Supervised learning is conducted via standard cross-entropy loss over labeled data:

Lsup=i=1NLc=1K1{yi=c}log(p^i,c)\mathcal{L}_{\mathrm{sup}} = -\sum_{i=1}^{N_L} \sum_{c=1}^K \mathbf{1}\{y_i=c\} \log(\hat{p}_{i,c})

Unlabeled data is incorporated using an adaptive semi-supervised strategy: pseudo-labels are generated with class- and global-confidence dual-thresholding, Δ\Delta-margin filtering, and entropy-based sample prioritization. High-confidence pseudo-labels are mixed into the loss. Regularization is provided via AdamW weight decay, LoRA dropout on adapter matrices, and LayerNorm within MR-DAN to ensure stable optimization.

All graph parameters (WQh,WKh,WVh,WOW_Q^h, W_K^h, W_V^h, W_O), speaker embeddings, and LoRA adapter weights for downstream LLM are updated jointly via end-to-end backpropagation.


MR-DAN, as presented in DialogGraph-LLM, operationalizes multi-relational graph attention mechanisms for audio dialogue understanding, providing an explicit and highly expressive framework for encoding the complex relational structure of multi-party spoken interactions (Liu et al., 14 Nov 2025). Its careful architectural integration with large multimodal foundation models renders it suitable for intent recognition in resource-scarce, audio-rich domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-Relational Dialogue Attention Network (MR-DAN).