Papers
Topics
Authors
Recent
2000 character limit reached

DialogGraph-LLM: Audio Dialogue Intent Framework

Updated 17 November 2025
  • DialogGraph-LLM is an end-to-end system that fuses multi-relational graph neural architectures with multimodal large language models to accurately capture speaker intents from audio dialogues.
  • Its innovative MR-DAN component leverages temporal, speaker-specific, and semantic relations, yielding up to +13.7% accuracy gains over traditional baselines.
  • The framework employs adaptive semi-supervised learning with dynamic thresholds, ensuring robust performance and scalable inference in real-world applications.

DialogGraph-LLM is an end-to-end framework for audio dialogue intent recognition that integrates multi-relational graph neural architectures with multimodal foundation LLMs, specifically designed to address complex inter-dependencies in multi-speaker audio dialogues and excel under limited supervision. The central innovation is the composition of a novel Multi-Relational Dialogue Attention Network (MR-DAN) with adaptive semi-supervised learning mechanisms, yielding robust, scalable inference directly from audio to speaker intent classification. DialogGraph-LLM demonstrates competitive performance in real-world and benchmark scenarios, outperforming prominent audio- and text-based baselines.

1. Pipeline Architecture and Preprocessing

DialogGraph-LLM’s pipeline commences with raw audio input from multi-speaker dialogues. Speaker diarization decomposes the audio AA into a sequence of utterance segments {aj}j=1L\{a_j\}_{j=1}^L, each assigned a speaker ID sjs_j. The Qwen2.5-Omni-7B backbone’s audio encoder Φ\Phi generates:

  • Utterance representations: hj=Φ(aj)h_j = \Phi(a_j) for each segment.
  • Dialogue-level representation: G=Φ(A)G = \Phi(A) incorporating the global acoustic overview.

These embeddings serve as node and global features for subsequent graph-based relational modeling.

2. Multi-Relational Dialogue Attention Network (MR-DAN)

Fundamentally, MR-DAN constructs a graph over utterances, with each node vjv_j initialized as: xj(0)=Wp[hj;esj]x_j^{(0)} = W_p\bigl[h_j;\,e_{s_j}\bigr] where esje_{s_j} is a learnable speaker embedding and hjh_j encodes the acoustic content.

The dialogue graph uses four hand-designed edge types (T=4T=4):

  • Temporal adjacency (sequential utterances)
  • Speaker-history adjacency (utterance by same speaker)
  • Cross-utterance semantic adjacency (cross-turn relations exceeding semantic similarity threshold θ\theta)
  • Self-loops.

Adjacency matrices R(k){0,1}L×LR^{(k)}\in\{0,1\}^{L\times L} are constructed for each edge type kk, determining the permitted attention neighborhoods.

Relation-aware multi-head attention is employed:

  • HH attention heads are partitioned into TT groups, each dedicated to a relation type kk.
  • Attention heads for relation kk only attend to neighbors under Nk(i)\mathcal N_k(i) according to R(k)R^{(k)}.
  • For each head:

ej,ih=(WQhxi)(WKhxj)dke_{j,i}^h = \frac{(W_Q^h\,x_i)^\top(W_K^h\,x_j)}{\sqrt{d_k}}

αj,ih=softmaxjNk(i)(ej,ih)\alpha_{j,i}^h = \mathrm{softmax}_{j\in\mathcal N_k(i)}(e_{j,i}^h)

zi,kh=jNk(i)αj,ihWVhxjz_{i,kh} = \sum_{j\in\mathcal N_k(i)}\alpha_{j,i}^h\,W_V^h\,x_j

Heads are concatenated per relation (zi,kz_{i,k}), followed by a relation-informed update: xi(+1)=LayerNorm(xi()+WO[zi,1zi,2zi,3zi,4])x_i^{(\ell+1)} = \mathrm{LayerNorm}\Bigl(x_i^{(\ell)} + W_O[z_{i,1}\,\|\,z_{i,2}\,\|\,z_{i,3}\,\|\,z_{i,4}]\Bigr)

An alternative update form aggregates learnable relation bias matrices WkW_k: A=softmax(QK/d+k=1TWkR(k))A = \mathrm{softmax}\Bigl(QK^\top / \sqrt{d} + \sum_{k=1}^T W_k R^{(k)}\Bigr)

X=LayerNorm(X+AV)X' = \mathrm{LayerNorm}(X + AV)

After LL iterations, the graph-level embedding gg is acquired via mean pooling.

3. LLM Integration and Input Fusion

DialogGraph-LLM leverages multimodal LLMs via customized input fusion:

  • Two lightweight adapters map GG (global audio embedding) and gg (graph embedding) to the LLM input space:

faudio(G)=WaG+ba,      fgraph(g)=Wgg+bgf_{\rm audio}(G)=W_aG+b_a,\;\;\;f_{\rm graph}(g)=W_gg+b_g

  • A prompt template (e.g., “Intent?”) is tokenized, and graph\langle\text{graph}\rangle, audio\langle\text{audio}\rangle tokens are replaced with their respective adapted embeddings.
  • These three input streams are concatenated at layer zero and processed through the LLM, producing intent label probabilities:

p^=M(Prompt,fgraph(g),faudio(G))\hat p = \mathcal{M}(\mathrm{Prompt},f_{\mathrm{graph}}(g),f_{\mathrm{audio}}(G))

y^=argmaxp^\hat y = \arg\max \hat p

4. Adaptive Semi-Supervised Learning and Pseudo-Labeling

Addressing limited supervision, DialogGraph-LLM incorporates an adaptive semi-supervised learning (SSL) strategy comprising dual-threshold filtering and entropy-based sample selection:

  • For each unlabeled instance xDUx \in D_U, obtain the posterior p(x)p(x) over intent classes.
  • Maintain global confidence threshold τg\tau_g via exponential moving average (EMA):

τg(t)λτg(t1)+(1λ)ExBU[maxipi(x)]\tau_g^{(t)}\leftarrow \lambda\,\tau_g^{(t-1)}+(1-\lambda)\,\mathbb E_{x\in\mathcal B_U}[\max_i p_i(x)]

  • Estimate class marginals p~c\tilde p_c by EMA, forming per-class thresholds:

τc=τgp~cmaxkp~k+δ\tau_c = \tau_g\,\frac{\tilde p_c}{\max_k\tilde p_k}+\delta

  • Filter instances: accept xx iff maxipi(x)τg\max_i p_i(x) \ge \tau_g and py^(x)τy^p_{\hat y}(x) \ge \tau_{\hat y}, where y^=argmaxipi(x)\hat y = \arg\max_i p_i(x).
  • Compute entropy:

H(p(x))=i=1Kpi(x)logpi(x)\mathcal H(p(x))=-\sum_{i=1}^K p_i(x)\,\log p_i(x)

  • Rank eligible samples by entropy, augmenting the training set with high-information pseudo-labels.

5. Optimization Objectives and Loss Functions

Training optimizes a unified objective over the labeled set (DLD_L) and selected pseudo-labeled samples (DPLD_{PL}):

  • Supervised cross-entropy loss:

Lsup=1DL(x,y)DLi=1K1i=ylogpi(x)\mathcal L_{\rm sup} = -\frac{1}{|D_L|}\sum_{(x,y)\in D_L}\sum_{i=1}^K \mathbf1_{i=y}\,\log p_i(x)

  • Unsupervised loss over pseudo-labels:

Lunsup=1DPL(x,y^)DPLlogpy^(x)\mathcal L_{\rm unsup} = -\frac{1}{|D_{PL}|}\sum_{(x,\hat y)\in D_{PL}}\log p_{\hat y}(x)

  • Regularized joint objective:

L=Lsup+λuLunsup+λrθ22\mathcal L = \mathcal L_{\rm sup} + \lambda_u\,\mathcal L_{\rm unsup} + \lambda_r\,\|\theta\|^2_2

where λu\lambda_u scales unsupervised contribution and λr\lambda_r is the L2L_2 regularization coefficient.

6. Empirical Evaluation and Results

DialogGraph-LLM is evaluated on two datasets:

  • MarketCalls: 8,770 real Mandarin sales calls, annotated across four hierarchical intent levels (A–D), with diarized speaker turns.
  • MIntRec 2.0: public multimodal benchmark for intent recognition (in-scope/out-of-scope) in audio/text dialogues.

Metrics include accuracy, macro-F1, and per-class F1. Baselines comprise Llama3.1-8B, GLM-4-9B, Gemini1.5-Pro, Qwen2.5-Omni, as well as multimodal methods MAG-BERT, MulT, TCL-MAP, and A-MESS.

Performance Summary

Dataset Best Baseline DialogGraph-LLM Gain
MarketCalls 63.6% / 63.1% (Qwen) 77.3% / 76.8% +13.7%
MIntRec 2.0 56.8% / 49.3% (A-MESS) 64.3% / 58.1% +7.5%

Ablation studies show that omitting MR-DAN results in sharp performance drops and that fixed-threshold SSL is suboptimal (73.6% accuracy vs. 77.3%). Optimizing MR-DAN hyperparameters (8 heads, window k=3k = 3, cross-turn threshold θ=0.8\theta = 0.8) produces consistent empirical peaks.

7. Contributions, Limitations, and Prospects

DialogGraph-LLM delivers:

  • An audio-to-intent pipeline integrating raw audio, relational graph structure, and LLMs.
  • MR-DAN, a multi-relational attention GNN modeling temporal, speaker, and semantic dialogue relations with fixed edge types.
  • Adaptive semi-supervised learning with dynamic thresholds and entropy-based instance selection.

Limitations include exclusive evaluation on Qwen2.5-Omni-7B, manual edge type specification, and residual pseudo-label noise. Suggested future work involves broadening foundation model backbones (e.g., GPT-4o, AudioPalm), end-to-end edge-type learning, advanced SSL schemes (consistency regularization, co-training), and extension to streaming/long-context inference.

A plausible implication is that integrating explicit dialogue graph structure with foundation LLMs under limited supervision yields substantial gains in intent recognition accuracy and data efficiency for audio-rich domains.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DialogGraph-LLM.