SoulChatCorpus: Chinese Empathy Dialogue Dataset
- SoulChatCorpus is a large-scale, Chinese-language multi-turn empathy dialogue dataset curated for psychological counseling and mental health applications.
- It utilizes a rigorous methodology combining crowdsourced data, ChatGPT-generated turns, and manual proofreading to ensure high quality and nuanced empathy expression.
- Fine-tuned using ChatGLM-6B with detailed evaluations, the dataset significantly enhances model fluency and empathetic response generation in counseling scenarios.
SoulChatCorpus is a large-scale, Chinese-language multi-turn empathy conversation dataset specifically designed for developing LLMs with advanced empathetic capabilities in the mental-health domain. Encompassing 2,300,248 high-quality dialogue samples between simulated users and psychological consultants, SoulChatCorpus represents the first million-scale, Chinese multi-turn empathy corpus explicitly curated for psychological counseling and emotionally attuned dialogue systems (Chen et al., 2023).
1. Dataset Composition and Structure
Each entry in SoulChatCorpus comprises a multi-turn conversation between a "user" and a "psychological consultant." Typical dialogues span 3–6 utterances (2–3 back-and-forth turns), though some extend to 8–10 utterances depending on the elaborateness of the original long-form reply. The corpus adopts a standardized JSON schema:
| Field | Type | Description |
|---|---|---|
| id | string | Unique identifier for the dialogue sample |
| context | list | Sequence of utterances (alternating "user" and "consultant," excluding the target) |
| response | string | The final consultant’s empathetic reply, to be generated by the model |
Speaker labels are consistently fixed as "user" and "consultant." The context contains the full dialogue history up to but not including the target consultant response. There is no separate manual annotation for empathy strategy classes; rather, six broad empathy strategies—questioning, comfort, recognition, listening, trust, and emotional support—are implicitly encoded via the data construction process.
2. Data Source, Generation, and Quality Control
The initial data pool was constructed via crowdsourcing, comprising 215,813 user questions and 619,725 consultant answers, drawn from twelve predefined counseling topics (see Figure 1 in the source paper). Data privacy was enforced through rule-based filtering to remove sensitive terms (e.g., “自杀,” “跳楼”) and further cleanup by professional proofreaders who excised or rewrote any personal or harmful content.
To generate multi-turn dialogues, each single-turn QA pair was processed by ChatGPT (99% gpt-3.5-turbo, 1% gpt-4), using a Chinese-language "empathy constraint prompt." This prompt directed ChatGPT to break the consultant’s answer into an interactive, multi-turn conversational sequence, showcasing explicit empathetic strategies such as questioning, listening, recognition, comfort, trust-building, and emotional support. All model-generated dialogues were subjected to manual proofreading, with 105,134 defective samples discarded. The resulting dataset is therefore of verified high quality, with logical coherence and accurate empathy expression (Chen et al., 2023).
3. Representative Dialogue Samples
Samples from SoulChatCorpus exemplify contextually sensitive, empathetic consultant responses. Two representative translated examples:
Example 1: Workplace Stress
- Context:
- user: “I’ve been feeling pressured at work ever since my manager started assigning tasks last minute.”
- consultant: “It sounds as if those sudden deadlines are causing you stress—could you tell me more about how it affects your day?”
- user: “I find myself anxious and unable to focus; I worry I’ll let the team down.”
Target Response: consultant: “I really understand how overwhelming it must feel when expectations change abruptly. You’ve been doing your best under pressure, and it’s natural to feel anxious. Let’s explore what support or boundaries you might put in place so you can manage these deadlines more comfortably.”
Example 2: Loneliness
- Context:
- user: “Ever since my friends moved away, I’ve been feeling deeply lonely.”
- consultant: “Loneliness can be painful. In what moments do you feel it most strongly?”
- user: “Mostly in the evenings, when I have no one to call or hang out with.”
Target Response: consultant: “I can imagine how quiet and isolating evenings must feel. You deserve connection and warmth. Perhaps we can brainstorm small steps to reach out—maybe a hobby group or even a quick check-in call—that help you feel less alone.”
These samples typify the dataset’s focus on empathetic recognition, reflection, and actionable support.
4. Fine-Tuning Methodology
The SoulChatCorpus was used to fine-tune ChatGLM-6B (6.2B parameters) using the conventional token-level cross-entropy loss:
where denotes the serialized conversation context, the previously generated tokens, and the target token at time step .
Fine-tuning regimen:
- Batch size: 80
- Training steps: 30,000 (approximately 1 epoch)
- Learning rate: warmup to (1,000 steps), then linear decay
- Maximum input length: 1,536 tokens
- Maximum target length: 512 tokens
- Optimizer: AdamW with default ’s and weight decay
- Decoding configuration: top-p sampling (), temperature 0.95
This setup leverages the scale of SoulChatCorpus for robust training and practical inference.
5. Evaluation Protocols and Results
SoulChatCorpus models were evaluated automatically (BLEU-1 to BLEU-4, ROUGE-1/2/L) on a 10,000-sample held-out test, and via human expert assessment (100 samples) on four CEHS metrics: Content naturalness (0–2), Empathy level (0–2), Helpfulness (0–2), and Safety (0–1). CEHS judgments were made by three psychology experts; Fleiss’ values for inter-rater agreement ranged from 0.472 to 1.00.
Comparative Metrics Table
| Model | BLEU-1 | BLEU-4 | ROUGE-L | Empathy (0–2) |
|---|---|---|---|---|
| ChatGLM-6B | 22.73 | 4.92 | 18.84 | 1.55 |
| MeChat | 29.43 | 6.71 | 21.12 | 1.70 |
| ChatGPT | 27.98 | 6.23 | 21.92 | 1.62 |
| SoulChat (fine-tuned) | 33.78 | 8.52 | 26.57 | 1.84 |
On zero-shot evaluation with SMILECHAT (355,733 samples), SoulChat yields BLEU-1 = 35.40 and Empathy = 1.90, a +12.5 BLEU-1 improvement over ChatGLM-6B. These results indicate significant gains in generating fluent, comfort-oriented, and empathetic model outputs.
6. Licensing, Distribution, and Citation
The SoulChatCorpus and associated SoulChat model will be distributed under an academic-research license, with details specified in the final publication and accompanying repository (https://github.com/scutcyr/SoulChat). The dataset download and explicit license file (e.g., CC-BY-NC-4.0 or equivalent) will be accessible via this link. Users are instructed to cite:
Y. Chen, X. Xing, J. Lin, H. Zheng, Z. Wang, Q. Liu, X. Xu. “SoulChat: Improving LLMs’ Empathy, Listening, and Comfort Abilities through Fine-tuning with Multi-turn Empathy Conversations,” ACL 2024.
In sum, SoulChatCorpus provides a rich, vetted resource for advancing LLMs’ ability to engage in nuanced, empathetic, and supportive multi-turn dialogue within the mental health context, with demonstrated effectiveness across both automatic and expert-centric human evaluation benchmarks (Chen et al., 2023).