AugESC Dataset: Scalable Emotional Support Data
- AugESC is a large-scale, heuristically augmented dataset for ESC, offering 45× more sessions than the traditional ESConv corpus.
- It leverages GPT-J 6B fine-tuning, controlled dialogue generation, and rigorous heuristic filtering to produce coherent and balanced multi-turn dialogues.
- Evaluations indicate that models trained on AugESC generalize robustly to diverse everyday emotional issues while matching crowdsourced dialogue quality.
AugESC is a large-scale, heuristically augmented dataset for the Emotional Support Conversation (ESC) task, designed to overcome the scale and topical limitations of the existing crowdsourced ESConv corpus by leveraging LLMs for data synthesis. Its construction, characteristics, and evaluation establish AugESC as a resource for robust training and generalization of dialogue systems providing emotional support across a broad range of everyday issues (Zheng et al., 2022).
1. Motivation and Objectives
ESC models require extensive, high-quality multi-turn dialogues, in which a "supporter" addresses the emotional distress described by a "seeker." The widely used ESConv corpus was crowdsourced at high cost, covering only 1.3 K sessions (≈38,000 utterances) spanning 13 narrowly defined topics (e.g., COVID-19, job loss), limiting the generalization of ESC models to diverse, open-domain problems.
AugESC was developed to:
- Expand the available ESC training data by 45× while matching the qualitative characteristics of ESConv,
- Vastly broaden topical coverage to include a wide range of everyday stressors, and
- Enable superior model generalization to open-domain emotional issues, by synthetically constructing full dialogues from real “starter posts” using a LLM.
2. Augmentation Methodology
Dialogue augmentation is formulated as a conditioned completion process, encompassing three key phases: model fine-tuning, dialogue generation, and rigorous heuristic filtering.
2.1 Task Formalization
Let be a "starter post" describing an emotional concern and an instruction prefix specifying ESC context and roles (seeker as “Human:”, supporter as “AI:”). The system generates a sequence:
The model outputs alternating utterances (with variable), stopping at the end-of-sequence marker.
2.2 Model Configuration
- Base Model: GPT-J 6B, an open-source transformer-based LM.
- Fine-tuning: 100 ESConv sessions (balanced topics), formatted as ESC tasks. Fine-tuned for 1 epoch, batch size 2, learning rate , AdamW optimizer, warmup steps 5, max input 1500 tokens, with gradient checkpointing.
- Generation: Nucleus sampling (), max output length 1500 tokens, repetition penalty 1.05.
- Starter Posts: 8,950 negative-emotion utterances from EmpatheticDialogues (length 10–60 tokens). Each serves as input for 10 generation attempts, yielding 89,500 raw dialogue candidates.
2.3 Heuristic Postprocessing
Filtering ensures coherent, balanced, and ESConv-compatible dialogues. Major criteria:
- Augmentation failures: Remove outputs with non-dialogue lines, missing end-of-sequence, or prompt leakage (intra-utterance role tags).
- Self-reinforcement avoidance: Exclude if one speaker has as many turns as the other or if the same speaker speaks 3 times consecutively.
- Distributional fit: Enforce turns per dialogue; average seeker/supporter turns within length bounds ([6, 40]/[8, 40] tokens); max turn tokens.
Post-filtering, 65,000 dialogues (72.7% retention) meet all standards. The retention ratio is
For topic analysis, the informative Dirichlet log-odds ratio is calculated for word between corpora and :
where is frequency, is total tokens, is vocabulary size, and the Dirichlet prior.
3. Dataset Characteristics
3.1 Scale
| Corpus | #Sessions | #Utterances | Avg. Turns/Session | Avg. Tokens/Turn |
|---|---|---|---|---|
| ESConv | 1,300 | 38,000 | 28.9 | 18.8 |
| AugESC | 65,000 | 1,738,000 | 26.7 | 18.7 |
AugESC constitutes a factor of 45 increase over ESConv in terms of session number, with similar dialogue depth and utterance length.
3.2 Topic and Lexical Diversity
Topic coverage is assessed by the top-30 words:
- ESConv: Dominated by COVID-19, health, work loss, and crowdsourcing artefacts (e.g., "pandemic", "covid", "zoom", "mturk").
- AugESC: Displays diverse everyday topics, e.g., "car," "dog," "house," "money," "neighbors," "parents," signifying broader content coverage.
TF-IDF pairwise cosine similarity distributions between dialogues reveal that both ESConv and AugESC have comparably low inter-dialogue similarity, indicating the preservation of topical diversity. PCA visualization of TF-IDF embeddings shows AugESC introduces novel dialogue clusters distinct from ESConv while retaining partial overlap.
4. Evaluation and Benchmarking
4.1 Human Quality Assessment
Randomly sampled dialogue subsets (60 per method; 3 annotators; 0–3 Likert scale) compare:
- ESConv (crowdsourced)
- Simulated chat: BlenderBot-1.4B, GPT-J-6B
- LLM-only (GPT-3, no fine-tuning)
- AugESC (GPT-J-6B + fine-tuning)
Metrics: Informativeness, Understanding, Helpfulness (ESC-specific), Consistency, Coherence, Unsafety (lower is better).
Results:
AugESC is comparable to ESConv across all metrics, each within 0.1–0.2 points (maximum: 3.0). It outperforms simulated chat and un-tuned LLM baselines significantly (, Student’s t-test) and exhibits unsafety rates on par with ESConv.
| Metric | ESConv | AugESC (Full) |
|---|---|---|
| Informativeness | 2.52 | 2.41 |
| Understanding | 2.42 | 2.37 |
| Helpfulness | 2.23 | 2.12 |
| Consistency | 2.56 | 2.34 |
| Coherence | 2.42 | 2.19 |
| Unsafety | low | low |
4.2 Generalization via Interactive Dialogue
Two 1.4B BlenderBot models:
- (A) Finetuned only on ESConv
- (B) Further post-trained on AugESC
N=60 participants conducted matched open-domain ESC chats (≥8 turns each) with both systems and chose which was superior on fluency, identification (empathetic understanding), comforting, suggestion, and overall support.
| Dimension | Model B Win (%) | Model A Win (%) |
|---|---|---|
| Fluency | 47 | 13 |
| Identification | 68 | 22 |
| Comforting | 55 | 22 |
| Suggestion | 58 | 15 |
| Overall | 58 | 28 |
All differences are statistically significant (sign test, ). This evidences that models trained on AugESC generalize more robustly to previously unseen support topics with minimal trade-off.
4.3 In-Domain Automatic Metrics
On the ESConv held-out test (200 sessions), ESConv-trained versus AugESC-posttrained models show only negligible changes in:
- Perplexity (PPL): 11.2 → 11.5
- BLEU-2/4: 7.8/2.4 → 7.7/2.4
- ROUGE-L: 16.9 → 16.7
- Distinct-2/3: 23.8/48.0 → 24.3/49.4
This confirms that open-domain gains do not compromise performance in the original ESC domain.
5. Guidelines and Best Practices for AugESC Usage
- Fine-tune downstream dialogue models (e.g., BlenderBot-1.4B) for 2–3 epochs on ESConv, then post-train for 1 epoch on AugESC (65 K sessions).
- Recommended hyperparameters: learning rate ≈ , batch size ≈ 16–32, AdamW optimizer, warmup ≈ 5% steps, nucleus sampling ≈ 0.9, repetition penalty ≈ 1.05.
- Heuristic filtering was employed; domain-specific toxicity or bias verification is advised prior to deployment.
- AugESC can be combined with in-domain or task-specific corpora via standard fine-tuning pipelines such as HuggingFace/Transformers.
6. Impact and Research Implications
AugESC establishes a scalable and rigorously filtered augmentation pipeline for ESC data synthesis, enabling a 45-fold increase in training material and dramatically wider topical breadth versus legacy ESC datasets. Its empirical validation (human and interactive evaluation) demonstrates that datasets synthesized via fine-tuned LLMs can match crowdsourced benchmarks on dialogue quality, while conferring substantial generalization benefits in downstream ESC models. A plausible implication is that LLM-augmented data generation may become a standard strategy in low-resource, high-quality dialogue domains where scale, diversity, and label richness are limiting factors (Zheng et al., 2022).