MoodBench 1.0: Benchmark for Emotional Dialogue
- The paper presents MoodBench 1.0, a theory-driven benchmark that formalizes Emotional Companionship Dialogue Systems and their evaluation criteria.
- It introduces a four-layer framework—ability, task, data, and method—that rigorously measures system performance using metrics like accuracy, BLEU, and Recall@K.
- Experimental results highlight strong performance in values and safety while exposing significant bottlenecks in emotional companionship and personalization capabilities.
MoodBench 1.0 is an evaluation benchmark designed for Emotional Companionship Dialogue Systems (ECDs), a distinct subclass of dialogue systems whose primary function is to provide personalized emotional support to users. Addressing the lack of precise definitions and standardized evaluations for ECDs, MoodBench 1.0 formalizes the ECD concept, operationalizes a multi-layered evaluation scheme, and provides the first large-scale, theory-driven benchmarking platform specifically targeting the measurement of emotional companionship capabilities in dialogue models (Jing et al., 24 Nov 2025).
1. Formal Definition and Hierarchical Framework
Within the general hierarchy of dialogue system types, ECDs are situated as a strict subset in the following inclusion chain: where denotes the set of systems of each class. An ECD is formally defined as an intelligent interactive system focused on supporting users’ emotional needs, specifically making users feel “seen, understood, supported.”
A dialogue session up to turn is represented as:
- : user utterances
- : system replies
- : long-term session memories
- : emotional states observed (based on Plutchik’s wheel)
- : system reply strategies (e.g., Chit-chat, Soothing, etc.)
- : system persona
- : user persona
- : external knowledge base
The system reply is generated by: where is the generation function. This formalism underpins all task and metric design in MoodBench 1.0.
2. Layered Evaluation Structure
MoodBench 1.0 adopts a four-layer structural decomposition (“top-down” design, “bottom-up” aggregation):
- Ability Layer: Partitioned into Threshold Abilities (Values & Safety), Foundational Abilities (NLU, NLI, NLG, Commonsense), and Core Abilities (Emotional Ability: emotion recognition, understanding, management, empathy, comprehensive EQ; Companionship Ability: memory & personalization).
- Task Layer: Each (sub)ability is probed by tasks at each of three difficulty levels (Low, Medium, High).
- Data Layer: 60 datasets (35% Chinese, 58% English, 6.7% bilingual), including 4 self-built “MoodBench” sets (MoodBench1–4).
- Method Layer: Benchmark-based (accuracy, F1, EM, MRR, Recall@K, BLEU, ROUGE, BLEURT) for closed tasks, and model-based/human-model judgments for open-ended generation.
Score aggregation follows the path Method → Dataset → Task → Ability → Total, with precise formulas provided for normalization and composition at each layer.
3. Benchmark Construction and Annotation
Each major ability is mapped to a collection of sub-abilities, tasks, and specific datasets:
- Threshold (Values & Safety): Bias Detection, Morality, Content Safety, Information Privacy, High-Risk Behavior—benchmarked on datasets including CrowS-Pairs, StereoSet, BBQ, and SafetyBench1–6.
- Foundational Abilities: Spanning NLU (AG News, THUCNews, etc.), NLI (MNLI, RTE, etc.), NLG (LCSTS, CNewSum), Commonsense (HellaSwag, TruthfulQA, etc.).
- Emotional Ability: Emotion Recognition (IMDb, SST-2, GoEmotions, MoodBench1), Emotion Understanding (MoodBench2, SemEval’18 Affect), Emotion Management (MoodBench3), Empathetic Response (EmoBench3, MoodBench4), Comprehensive EQ (EQ-60, IRI, interpersonal tests).
- Companionship Ability: Memory & Personalization (PersonaFeedback, Book-SORT, LongMemEval).
Annotation is based on Plutchik’s primary emotions (Joy, Trust, Fear, Surprise, Sadness, Disgust, Anger, Anticipation) and neutral, with expert annotation (Cohen’s κ > 0.8) for all self-built sets. Datasets are uniformly subsampled to 300 examples, and scores are normalized to a 0–100 scale before aggregation. No data augmentation is applied in version 1.0.
4. Evaluation Metrics and Aggregation
Evaluation utilizes standard classification and generation metrics:
- Accuracy, Precision (), Recall (), F1 Score:
- EM, Recall@K, MRR:
- BLEU, ROUGE, BLEURT for open-response.
- Human-rated completeness/conciseness for summaries.
The aggregation pipeline is mathematically defined:
- Dataset composite:
- Subtask score:
- Difficulty-level score:
- Sub-ability synthesis:
with weights , , .
The composite final score is a weighted sum across abilities, with a “one-vote veto” applied if Values & Safety falls below 60.
5. Experimental Results and Key Findings
Evaluating 30 mainstream dialogue models, MoodBench 1.0 establishes a detailed leaderboard, with the top five as follows:
| Rank | Model | Score |
|---|---|---|
| 1 | gpt-5-mini | 70.09 |
| 2 | doubao-pro-32k | 69.44 |
| 3 | gpt-5 | 69.34 |
| 4 | gpt-4o | 69.09 |
| 5 | gpt-4.1 | 68.57 |
Models failing the Values & Safety threshold (score < 60) are excluded from the rankings. Closed-source models outperform open-source ones on average, and larger model variants are consistently superior within families (Doubao, GPT, Qwen). Foundational and Core ability scores are correlated (), but high foundational ability does not guarantee high emotional companionship; targeted development remains necessary.
Dimension-specific performance means (Fig 2 of the source) indicate:
- Values & Safety: ≈93
- Foundational: ≈78
- Emotional Ability: ≈68
- Companionship Ability: ≈57 (lowest across all models)
These results align with user survey data (85% desire emotional support; 77% report dissatisfaction), highlighting deficiencies in memory and personalization features.
6. Limitations, Identified Gaps, and Prospective Directions
Chief shortcomings identified in current ECD models include:
- Companionship Ability bottleneck, especially regarding long-term memory (LongMemEval) and personalization.
- Poor performance on advanced emotional tasks (e.g., recognition of mixed emotions, irony, or humor).
- Absence of multimodal understanding (audio, vision) and limited coverage of cross-cultural empathy.
Recommendations for future benchmarks and model design are:
- Multimodal & Cross-Lingual Expansion: Enrich with tasks in additional languages (e.g., Spanish, Japanese) and modalities (audio, video).
- High-Order Emotional Tasks: Develop datasets for complex affective phenomena—mixed emotions, irony, sarcasm.
- Dynamic Memory Mechanisms: Integrate advanced memory management modules to improve long-term recall and personalization.
- Novel Evaluation Paradigms: Complement static benchmarks with dynamic, model-based “self-chat” and human-in-the-loop evaluations to potentially mitigate dataset bias.
The multi-layered, task-difficulty-stratified approach permits granular identification of model bottlenecks (e.g., low Recall@K in LongMemEval), thereby guiding the architectural or methodological improvements needed for advancing deep emotional companionship capabilities (Jing et al., 24 Nov 2025).