Papers
Topics
Authors
Recent
Search
2000 character limit reached

HumDial: Benchmark for Emotional Dialogue

Updated 4 February 2026
  • HumDial Dataset is a benchmark corpus for evaluating human-like spoken dialogue systems, focusing on emotional intelligence and full-duplex interaction.
  • It offers large-scale, high-fidelity multi-turn conversations with annotations for emotional trajectories, prosodic cues, and interruption events.
  • The dataset establishes evaluation protocols and baselines for Audio-LLMs, highlighting challenges in empathy generation and real-time dialogue management.

The HumDial Dataset is a benchmark corpus and evaluation platform designed to advance research on human-like spoken dialogue systems, with particular emphasis on emotional intelligence and full-duplex interaction. Introduced through the ICASSP 2026 HumDial Challenge, the dataset provides large-scale, high-fidelity multi-turn conversations that model both long-term emotional trajectories and real-time conversational phenomena, such as barge-in events and empathy generation. It establishes evaluation standards and task protocols specifically for next-generation Audio-LLMs and omni-modal systems operating at the intersection of affective computing and interactive dialogue management (Zhao et al., 9 Jan 2026).

1. Composition, Scale, and Structure

The HumDial Dataset is partitioned into two principal tracks, each with distinct annotation conventions, splits, and target use cases.

Track I: Emotional Intelligence

This track covers three core tasks: Emotional Trajectory Detection, Emotional Reasoning, and Empathy Assessment. The training set includes 1,600 dialogues of lengths 3, 4, and 5 turns for both Tasks 1 and 2. Development and test splits maintain balanced distributions. Task 3 samples are extracted as segments from Tasks 1/2 dialogues. Utterance counts are as follows: 38,400 (train), 2,400 (dev), 2,332 (test).

Track II: Full-Duplex Interaction

Dialogue segments focus on authentic interruption and rejection events. Utterance counts: 9,418 (train), 1,800 (dev), 5,000 (test).

Table 1: Data Statistics by Split and Track

Split Track I #Utterances Track II #Utterances
Train 38,400 9,418
Dev 2,400 1,800
Test 2,332 5,000

Audio is 16-bit PCM WAV, single channel, sampled at 16 kHz (implied from standards). Preprocessing includes silence trimming, level normalization (–23 LUFS), and optional spectral noise reduction.

Transcripts are turn-segmented, in plain text with structured JSON metadata, including professional actor IDs, gender, simulated age brackets (20–30/30–40), and language (English/Chinese). All splits are speaker- and topic-balanced with no actor or scenario overlap.

2. Data Construction and Annotation Methodology

Dialogue scripts are generated by LLMs: Gemini 2.5-pro (Track I) for emotional scenarios and DeepSeek (Track II) for full-duplex settings. Dialogue domains span everyday topics—travel, health, work stress, social planning—to elicit dynamic emotional and interactional responses.

Recordings are conducted in sound-treated booths (ambient noise <30 dB) using professional cardioid condenser microphones and high-quality interfaces (models not specified). Overlap in speech is performed live by actors to ensure authentic prosodic synchronization rather than synthetic mixing.

Annotations are performed by 20 expert annotators (balanced by language), each with a minimum of 6 months of experience and at least a Bachelor’s degree. ELAN and a custom web platform are used for timestamp and emotion-tag alignment. Inter-annotator agreement (IAA) is measured as Krippendorff’s α = 0.82 (emotion labels) and Cohen’s κ = 0.79 (interruption/rejection tags).

Emotion categories are balanced across six primary states (e.g., joy, sadness, anger, fear, surprise, neutral). For Track II, additional interruption tags (Follow-up Q, Negation/Dissatisfaction, Repetition Request, Topic Switch, Silence/Termination) and rejection tags (User Real-time Backchannel, Pause Handling, Third-Party Speech, Speech Directed to Others) are applied.

3. Task Definition and Track Protocols

Track I evaluates models on three tasks:

  • Task 1: Emotional Trajectory Detection—predict the per-turn emotion label sequence and provide a trajectory summary.
  • Task 2: Emotional Reasoning—generate natural-language explanations for each emotional shift.
  • Task 3: Empathy Assessment—generate empathetic replies in both text and synthesized audio, conditioned on the final user turn and preceding context.

Track II is focused on real-time, full-duplex dialogue management:

  • Scenario A: Interruption Handling—systems detect and respond to user barge-ins without context loss.
  • Scenario B: Rejection Handling—systems must suppress output given non-instructional speech events.
  • Decisions are made at each 50 ms frame with the options {“speak,” “listen,” “hold”}.

4. Evaluation Criteria, Baselines, and Performance

Metrics are tracked both at the utterance and overall system levels:

  • Emotion classification accuracy:

Acc=1Ni=1N1(y^i=yi)\text{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\hat{y}_i = y_i)

  • Precision, Recall, F₁:

P=TPTP+FP,R=TPTP+FN,F1=2PRP+RP = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}, \quad R = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, \quad F_1 = 2\frac{P\,R}{P+R}

  • Empathy text metrics: BLEU-n and METEOR computed as per standard definitions.
  • Latency:

L=tresp_starttutt_endL = t_{\text{resp\_start}} - t_{\text{utt\_end}}

  • Throughput:

TPH=#output charstresp_endtresp_start\mathrm{TPH} = \frac{\#\,\text{output chars}}{t_{\text{resp\_end}} - t_{\text{resp\_start}}}

Overall scores aggregate per-task or per-scenario subscores using defined linear weightings:

  • Track I:

ScoreI=0.2ST1+0.2ST2+0.1Stext+0.25Semo+0.25Snat\text{Score}_{\mathrm{I}} = 0.2\,S_{T1} + 0.2\,S_{T2} + 0.1\,S_{\text{text}} + 0.25\,S_{\text{emo}} + 0.25\,S_{\text{nat}}

  • Track II:

ScoreII=0.4SInt+0.4SRej+0.2SDelay\text{Score}_{\mathrm{II}} = 0.4\,S_{\mathrm{Int}} + 0.4\,S_{\mathrm{Rej}} + 0.2\,S_{\mathrm{Delay}}

Baselines:

  • Track I baseline (Audio-LLM pipelined): Final Score = 2.82
  • Track II baseline (cascade ASR+LLM+TTS): Final Score = 56.4
  • Statistical significance is assessed using paired t-tests (p<0.05p < 0.05) on per-utterance metrics.

Table 2: Excerpt of Track I Final Scores

Team Task1_Avg Task2_Avg Task3_Avg Final Score Rank
TeleAI 4.97 4.98 3.81 4.27 1
NJU-TencentHY 4.90 5.00 3.84 4.24 2
Baseline 2.62 2.73 2.88 2.82 8

Table 3: Excerpt of Track II Final Scores

Team Int. % Rej. % Latency (s) Final Score Rank
Cookie_asr 79.3 72.2 1.260 76.6 1
Badcat 89.7 57.8 1.632 73.5 2
Baseline 75.9 35.2 2.531 56.4 6

5. Observed System Behaviors and Analysis

Top-performing systems demonstrate near-ceiling performance for emotion detection and reasoning, but empathy generation remains a challenge, with noticeable deficits in prosodic matching relative to emotional context. Real-time rejection proves difficult, as many systems confuse low-energy or background speech with valid input, leading to false starts. Latency remains a limiting factor under overlapping speech.

A plausible implication is that future system design must more tightly integrate prosodic modeling, fine-grained VAD, and robust context tracking, especially for empathetic and duplex interaction under real-world acoustic conditions.

6. Recommendations and Future Directions

Extending the emotional taxonomy to cover a broader spectrum of affective states—including frustration and embarrassment—as well as continuous arousal/valence dimensions, is recommended. The inclusion of multi-party and multi-language scenarios is necessary to further challenge and generalize turn-taking models. Incorporation of real-world background noise (e.g., café, street) and device diversity (headset, phone) will enhance ecological validity. Benchmarking end-to-end Audio-LLMs equipped with integrated VAD and noise-robust preprocessing is specifically advised.

Detailed per-utterance error analyses and richer IAA statistics should be published to facilitate annotation and model refinements. By providing densely annotated multi-turn emotional trajectories in authentic full-duplex conversations, the HumDial Dataset establishes a unified standard for human-like spoken dialogue systems at the intersection of affective and interactive intelligence (Zhao et al., 9 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HumDial Dataset.