Multi-Relational Dialogue Attention Network
- The paper introduces MR-DAN which captures multi-speaker dialogue dependencies by modeling temporal, speaker-continuity, semantic, and self-loop relations.
- It employs multi-head, edge-type-specific graph attention combined with audio and speaker embeddings to build dense, context-aware utterance representations.
- Integration with multimodal foundation models and LoRA adapters enables robust, end-to-end acoustic-to-intent inference with minimal supervision.
The Multi-Relational Dialogue Attention Network (MR-DAN) is a neural architecture designed to capture the heterogeneous structural dependencies within multi-speaker audio dialogues, facilitating effective acoustic-to-intent inference in scenarios where utterances exhibit complex inter-utterance and speaker dependencies. Introduced as the core graph module in the DialogGraph-LLM framework, MR-DAN fuses relation-aware graph-based attention with multimodal foundation models, enabling end-to-end audio dialogue intent recognition with minimal supervision (Liu et al., 14 Nov 2025).
1. Purpose and Core Architecture
MR-DAN's primary objective is to encode the intricate internal structure of spoken dialogues—including speaker turns, temporal sequencing, and semantic interactions between both proximate and distant utterances—into dense representations suitable for intent classification. The high-level pipeline comprises:
- Audio encoder (Qwen2.5-Omni-audio): Transforms each utterance into a dense embedding .
- Speaker embedding lookup: Each speaker is mapped to an embedding .
- Initial node features: Utterance and speaker embeddings are concatenated and linearly projected: .
- Multi-relational dialogue graph : Nodes represent utterances; edges are of four typed, directed relations.
- MR-DAN layers: Employ multi-head, edge-type-specific graph attention mechanisms.
- Graph pooling and LLM integration: Node representations are pooled and combined, via prompt engineering, with a global audio embedding for downstream intent classification.
This modular decomposition facilitates the capture of relational priors critical for modeling speaker- and discourse-level context in conversational AI.
2. Dialogue Graph Construction
2.1 Node Representations
The dialogue audio is segmented into utterances , each tagged by a speaker . The pre-trained audio encoder generates utterance embeddings . Distinct speaker embeddings are learned and concatenated with . The resulting vectors are projected to form initial node features:
2.2 Edge Types
MR-DAN defines four relation types, establishing a heterogeneous directed graph. For node , let denote nodes with an incoming edge of type :
- Temporal edges (): , encoding sequential turn-taking.
- Speaker-continuity edges (): Edges from the most recent utterances by the same speaker (), capturing per-speaker dialogue threads.
- Cross-utterance (semantic) edges (): Connect node to if either or with a learnable threshold, representing long-range semantic dependencies and topical jumps.
- Self-loops (): , supporting positional context.
By coding these relational structures, MR-DAN enables explicit modeling of multifaceted conversational context beyond linear utterance ordering.
3. Multi-Relational Graph Attention Mechanism
Let denote the number of edge types, and the total number of attention heads, partitioned by type: . Each attention head for relation type uses separate learned parameters .
3.1 Attention Score Computation
For node , relation , and head , attention scores are computed only over :
where .
3.2 Attention Weights and Value Aggregation
Scores are softmax-normalized over :
Each head aggregates neighbor features:
3.3 Relation-Specific and Final Node Update
Heads within each relation are concatenated: . The per-relation outputs are then concatenated, linearly projected, and merged with a residual connection and layer norm:
where .
Multiple MR-DAN layers are stacked (typically 2–4) to increase representational depth.
4. Handling Relation Types and Multi-Head Parameterization
Partitioning attention heads by relation type, with each head set and parameter triplet , grants MR-DAN the capacity to model each edge type with a distinct aggregation function. Temporal, speaker-continuity, and semantic edges are processed independently, while their outputs are fused at each layer. This architecture supports heterogeneous information multiplexing and preserves the representational flexibility inherent to multi-head attention schemes.
5. Integration with Multimodal Foundation Models
Following MR-DAN layers, mean pooling yields the graph-level embedding:
In parallel, the entire audio dialogue is encoded as . These representations are injected as continuous multimodal tokens into a prompt for the LLM (Qwen2.5-Omni-7B):
where and appear among the first tokens of the LLM’s transformer stack. The final output head, fine-tuned via parameter-efficient LoRA adapters, emits a softmax distribution over intent classes.
6. Forward Pass Implementation Summary
The stepwise MR-DAN forward computation can be summarized as follows:
1 |
7. Training and Optimization Objectives
Supervised learning is conducted via standard cross-entropy loss over labeled data:
Unlabeled data is incorporated using an adaptive semi-supervised strategy: pseudo-labels are generated with class- and global-confidence dual-thresholding, -margin filtering, and entropy-based sample prioritization. High-confidence pseudo-labels are mixed into the loss. Regularization is provided via AdamW weight decay, LoRA dropout on adapter matrices, and LayerNorm within MR-DAN to ensure stable optimization.
All graph parameters (), speaker embeddings, and LoRA adapter weights for downstream LLM are updated jointly via end-to-end backpropagation.
MR-DAN, as presented in DialogGraph-LLM, operationalizes multi-relational graph attention mechanisms for audio dialogue understanding, providing an explicit and highly expressive framework for encoding the complex relational structure of multi-party spoken interactions (Liu et al., 14 Nov 2025). Its careful architectural integration with large multimodal foundation models renders it suitable for intent recognition in resource-scarce, audio-rich domains.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free