DialogGraph-LLM: Audio Dialogue Intent Framework

Updated 17 November 2025

DialogGraph-LLM is an end-to-end system that fuses multi-relational graph neural architectures with multimodal large language models to accurately capture speaker intents from audio dialogues.
Its innovative MR-DAN component leverages temporal, speaker-specific, and semantic relations, yielding up to +13.7% accuracy gains over traditional baselines.
The framework employs adaptive semi-supervised learning with dynamic thresholds, ensuring robust performance and scalable inference in real-world applications.

DialogGraph-LLM is an end-to-end framework for audio dialogue intent recognition that integrates multi-relational graph neural architectures with multimodal foundation LLMs, specifically designed to address complex inter-dependencies in multi-speaker audio dialogues and excel under limited supervision. The central innovation is the composition of a novel Multi-Relational Dialogue Attention Network (MR-DAN) with adaptive semi-supervised learning mechanisms, yielding robust, scalable inference directly from audio to speaker intent classification. DialogGraph-LLM demonstrates competitive performance in real-world and benchmark scenarios, outperforming prominent audio- and text-based baselines.

1. Pipeline Architecture and Preprocessing

DialogGraph-LLM’s pipeline commences with raw audio input from multi-speaker dialogues. Speaker diarization decomposes the audio $A$ into a sequence of utterance segments $\{a_j\}_{j=1}^L$ , each assigned a speaker ID $s_j$ . The Qwen2.5-Omni-7B backbone’s audio encoder $\Phi$ generates:

Utterance representations: $h_j = \Phi(a_j)$ for each segment.
Dialogue-level representation: $G = \Phi(A)$ incorporating the global acoustic overview.

These embeddings serve as node and global features for subsequent graph-based relational modeling.

2. Multi-Relational Dialogue Attention Network (MR-DAN)

Fundamentally, MR-DAN constructs a graph over utterances, with each node $v_j$ initialized as: $x_j^{(0)} = W_p\bigl[h_j;\,e_{s_j}\bigr]$ where $e_{s_j}$ is a learnable speaker embedding and $h_j$ encodes the acoustic content.

The dialogue graph uses four hand-designed edge types ( $T=4$ ):

Temporal adjacency (sequential utterances)
Speaker-history adjacency (utterance by same speaker)
Cross-utterance semantic adjacency (cross-turn relations exceeding semantic similarity threshold $\theta$ )
Self-loops.

Adjacency matrices $R^{(k)}\in\{0,1\}^{L\times L}$ are constructed for each edge type $k$ , determining the permitted attention neighborhoods.

Relation-aware multi-head attention is employed:

$H$ attention heads are partitioned into $T$ groups, each dedicated to a relation type $k$ .
Attention heads for relation $k$ only attend to neighbors under $\mathcal N_k(i)$ according to $R^{(k)}$ .
For each head:

$e_{j,i}^h = \frac{(W_Q^h\,x_i)^\top(W_K^h\,x_j)}{\sqrt{d_k}}$

$\alpha_{j,i}^h = \mathrm{softmax}_{j\in\mathcal N_k(i)}(e_{j,i}^h)$

$z_{i,kh} = \sum_{j\in\mathcal N_k(i)}\alpha_{j,i}^h\,W_V^h\,x_j$

Heads are concatenated per relation ( $z_{i,k}$ ), followed by a relation-informed update: $x_i^{(\ell+1)} = \mathrm{LayerNorm}\Bigl(x_i^{(\ell)} + W_O[z_{i,1}\,\|\,z_{i,2}\,\|\,z_{i,3}\,\|\,z_{i,4}]\Bigr)$

An alternative update form aggregates learnable relation bias matrices $W_k$ : $A = \mathrm{softmax}\Bigl(QK^\top / \sqrt{d} + \sum_{k=1}^T W_k R^{(k)}\Bigr)$

$X' = \mathrm{LayerNorm}(X + AV)$

After $L$ iterations, the graph-level embedding $g$ is acquired via mean pooling.

3. LLM Integration and Input Fusion

DialogGraph-LLM leverages multimodal LLMs via customized input fusion:

Two lightweight adapters map $G$ (global audio embedding) and $g$ (graph embedding) to the LLM input space:

$f_{\rm audio}(G)=W_aG+b_a,\;\;\;f_{\rm graph}(g)=W_gg+b_g$

A prompt template (e.g., “Intent?”) is tokenized, and $\langle\text{graph}\rangle$ , $\langle\text{audio}\rangle$ tokens are replaced with their respective adapted embeddings.
These three input streams are concatenated at layer zero and processed through the LLM, producing intent label probabilities:

$\hat p = \mathcal{M}(\mathrm{Prompt},f_{\mathrm{graph}}(g),f_{\mathrm{audio}}(G))$

$\hat y = \arg\max \hat p$

4. Adaptive Semi-Supervised Learning and Pseudo-Labeling

Addressing limited supervision, DialogGraph-LLM incorporates an adaptive semi-supervised learning (SSL) strategy comprising dual-threshold filtering and entropy-based sample selection:

For each unlabeled instance $x \in D_U$ , obtain the posterior $p(x)$ over intent classes.
Maintain global confidence threshold $\tau_g$ via exponential moving average (EMA):

$\tau_g^{(t)}\leftarrow \lambda\,\tau_g^{(t-1)}+(1-\lambda)\,\mathbb E_{x\in\mathcal B_U}[\max_i p_i(x)]$

Estimate class marginals $\tilde p_c$ by EMA, forming per-class thresholds:

$\tau_c = \tau_g\,\frac{\tilde p_c}{\max_k\tilde p_k}+\delta$

Filter instances: accept $x$ iff $\max_i p_i(x) \ge \tau_g$ and $p_{\hat y}(x) \ge \tau_{\hat y}$ , where $\hat y = \arg\max_i p_i(x)$ .
Compute entropy:

$\mathcal H(p(x))=-\sum_{i=1}^K p_i(x)\,\log p_i(x)$

Rank eligible samples by entropy, augmenting the training set with high-information pseudo-labels.

5. Optimization Objectives and Loss Functions

Training optimizes a unified objective over the labeled set ( $D_L$ ) and selected pseudo-labeled samples ( $D_{PL}$ ):

Supervised cross-entropy loss:

$\mathcal L_{\rm sup} = -\frac{1}{|D_L|}\sum_{(x,y)\in D_L}\sum_{i=1}^K \mathbf1_{i=y}\,\log p_i(x)$

Unsupervised loss over pseudo-labels:

$\mathcal L_{\rm unsup} = -\frac{1}{|D_{PL}|}\sum_{(x,\hat y)\in D_{PL}}\log p_{\hat y}(x)$

Regularized joint objective:

$\mathcal L = \mathcal L_{\rm sup} + \lambda_u\,\mathcal L_{\rm unsup} + \lambda_r\,\|\theta\|^2_2$

where $\lambda_u$ scales unsupervised contribution and $\lambda_r$ is the $L_2$ regularization coefficient.

6. Empirical Evaluation and Results

DialogGraph-LLM is evaluated on two datasets:

MarketCalls: 8,770 real Mandarin sales calls, annotated across four hierarchical intent levels (A–D), with diarized speaker turns.
MIntRec 2.0: public multimodal benchmark for intent recognition (in-scope/out-of-scope) in audio/text dialogues.

Metrics include accuracy, macro-F1, and per-class F1. Baselines comprise Llama3.1-8B, GLM-4-9B, Gemini1.5-Pro, Qwen2.5-Omni, as well as multimodal methods MAG-BERT, MulT, TCL-MAP, and A-MESS.

Performance Summary

Dataset	Best Baseline	DialogGraph-LLM	Gain
MarketCalls	63.6% / 63.1% (Qwen)	77.3% / 76.8%	+13.7%
MIntRec 2.0	56.8% / 49.3% (A-MESS)	64.3% / 58.1%	+7.5%

Ablation studies show that omitting MR-DAN results in sharp performance drops and that fixed-threshold SSL is suboptimal (73.6% accuracy vs. 77.3%). Optimizing MR-DAN hyperparameters (8 heads, window $k = 3$ , cross-turn threshold $\theta = 0.8$ ) produces consistent empirical peaks.

7. Contributions, Limitations, and Prospects

DialogGraph-LLM delivers:

An audio-to-intent pipeline integrating raw audio, relational graph structure, and LLMs.
MR-DAN, a multi-relational attention GNN modeling temporal, speaker, and semantic dialogue relations with fixed edge types.
Adaptive semi-supervised learning with dynamic thresholds and entropy-based instance selection.

Limitations include exclusive evaluation on Qwen2.5-Omni-7B, manual edge type specification, and residual pseudo-label noise. Suggested future work involves broadening foundation model backbones (e.g., GPT-4o, AudioPalm), end-to-end edge-type learning, advanced SSL schemes (consistency regularization, co-training), and extension to streaming/long-context inference.

A plausible implication is that integrating explicit dialogue graph structure with foundation LLMs under limited supervision yields substantial gains in intent recognition accuracy and data efficiency for audio-rich domains.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DialogGraph-LLM.

DialogGraph-LLM: Audio Dialogue Intent Framework

1. Pipeline Architecture and Preprocessing

2. Multi-Relational Dialogue Attention Network (MR-DAN)

3. LLM Integration and Input Fusion

4. Adaptive Semi-Supervised Learning and Pseudo-Labeling

5. Optimization Objectives and Loss Functions

6. Empirical Evaluation and Results

Performance Summary

7. Contributions, Limitations, and Prospects

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DialogGraph-LLM: Audio Dialogue Intent Framework

1. Pipeline Architecture and Preprocessing

2. Multi-Relational Dialogue Attention Network (MR-DAN)

3. LLM Integration and Input Fusion

4. Adaptive Semi-Supervised Learning and Pseudo-Labeling

5. Optimization Objectives and Loss Functions

6. Empirical Evaluation and Results

Performance Summary

7. Contributions, Limitations, and Prospects

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research