DialogGraph-LLM: Audio Dialogue Intent Framework
- DialogGraph-LLM is an end-to-end system that fuses multi-relational graph neural architectures with multimodal large language models to accurately capture speaker intents from audio dialogues.
- Its innovative MR-DAN component leverages temporal, speaker-specific, and semantic relations, yielding up to +13.7% accuracy gains over traditional baselines.
- The framework employs adaptive semi-supervised learning with dynamic thresholds, ensuring robust performance and scalable inference in real-world applications.
DialogGraph-LLM is an end-to-end framework for audio dialogue intent recognition that integrates multi-relational graph neural architectures with multimodal foundation LLMs, specifically designed to address complex inter-dependencies in multi-speaker audio dialogues and excel under limited supervision. The central innovation is the composition of a novel Multi-Relational Dialogue Attention Network (MR-DAN) with adaptive semi-supervised learning mechanisms, yielding robust, scalable inference directly from audio to speaker intent classification. DialogGraph-LLM demonstrates competitive performance in real-world and benchmark scenarios, outperforming prominent audio- and text-based baselines.
1. Pipeline Architecture and Preprocessing
DialogGraph-LLM’s pipeline commences with raw audio input from multi-speaker dialogues. Speaker diarization decomposes the audio into a sequence of utterance segments , each assigned a speaker ID . The Qwen2.5-Omni-7B backbone’s audio encoder generates:
- Utterance representations: for each segment.
- Dialogue-level representation: incorporating the global acoustic overview.
These embeddings serve as node and global features for subsequent graph-based relational modeling.
2. Multi-Relational Dialogue Attention Network (MR-DAN)
Fundamentally, MR-DAN constructs a graph over utterances, with each node initialized as: where is a learnable speaker embedding and encodes the acoustic content.
The dialogue graph uses four hand-designed edge types ():
- Temporal adjacency (sequential utterances)
- Speaker-history adjacency (utterance by same speaker)
- Cross-utterance semantic adjacency (cross-turn relations exceeding semantic similarity threshold )
- Self-loops.
Adjacency matrices are constructed for each edge type , determining the permitted attention neighborhoods.
Relation-aware multi-head attention is employed:
- attention heads are partitioned into groups, each dedicated to a relation type .
- Attention heads for relation only attend to neighbors under according to .
- For each head:
Heads are concatenated per relation (), followed by a relation-informed update:
An alternative update form aggregates learnable relation bias matrices :
After iterations, the graph-level embedding is acquired via mean pooling.
3. LLM Integration and Input Fusion
DialogGraph-LLM leverages multimodal LLMs via customized input fusion:
- Two lightweight adapters map (global audio embedding) and (graph embedding) to the LLM input space:
- A prompt template (e.g., “Intent?”) is tokenized, and , tokens are replaced with their respective adapted embeddings.
- These three input streams are concatenated at layer zero and processed through the LLM, producing intent label probabilities:
4. Adaptive Semi-Supervised Learning and Pseudo-Labeling
Addressing limited supervision, DialogGraph-LLM incorporates an adaptive semi-supervised learning (SSL) strategy comprising dual-threshold filtering and entropy-based sample selection:
- For each unlabeled instance , obtain the posterior over intent classes.
- Maintain global confidence threshold via exponential moving average (EMA):
- Estimate class marginals by EMA, forming per-class thresholds:
- Filter instances: accept iff and , where .
- Compute entropy:
- Rank eligible samples by entropy, augmenting the training set with high-information pseudo-labels.
5. Optimization Objectives and Loss Functions
Training optimizes a unified objective over the labeled set () and selected pseudo-labeled samples ():
- Supervised cross-entropy loss:
- Unsupervised loss over pseudo-labels:
- Regularized joint objective:
where scales unsupervised contribution and is the regularization coefficient.
6. Empirical Evaluation and Results
DialogGraph-LLM is evaluated on two datasets:
- MarketCalls: 8,770 real Mandarin sales calls, annotated across four hierarchical intent levels (A–D), with diarized speaker turns.
- MIntRec 2.0: public multimodal benchmark for intent recognition (in-scope/out-of-scope) in audio/text dialogues.
Metrics include accuracy, macro-F1, and per-class F1. Baselines comprise Llama3.1-8B, GLM-4-9B, Gemini1.5-Pro, Qwen2.5-Omni, as well as multimodal methods MAG-BERT, MulT, TCL-MAP, and A-MESS.
Performance Summary
| Dataset | Best Baseline | DialogGraph-LLM | Gain |
|---|---|---|---|
| MarketCalls | 63.6% / 63.1% (Qwen) | 77.3% / 76.8% | +13.7% |
| MIntRec 2.0 | 56.8% / 49.3% (A-MESS) | 64.3% / 58.1% | +7.5% |
Ablation studies show that omitting MR-DAN results in sharp performance drops and that fixed-threshold SSL is suboptimal (73.6% accuracy vs. 77.3%). Optimizing MR-DAN hyperparameters (8 heads, window , cross-turn threshold ) produces consistent empirical peaks.
7. Contributions, Limitations, and Prospects
DialogGraph-LLM delivers:
- An audio-to-intent pipeline integrating raw audio, relational graph structure, and LLMs.
- MR-DAN, a multi-relational attention GNN modeling temporal, speaker, and semantic dialogue relations with fixed edge types.
- Adaptive semi-supervised learning with dynamic thresholds and entropy-based instance selection.
Limitations include exclusive evaluation on Qwen2.5-Omni-7B, manual edge type specification, and residual pseudo-label noise. Suggested future work involves broadening foundation model backbones (e.g., GPT-4o, AudioPalm), end-to-end edge-type learning, advanced SSL schemes (consistency regularization, co-training), and extension to streaming/long-context inference.
A plausible implication is that integrating explicit dialogue graph structure with foundation LLMs under limited supervision yields substantial gains in intent recognition accuracy and data efficiency for audio-rich domains.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free