Papers
Topics
Authors
Recent
Search
2000 character limit reached

AugESC Dataset: Scalable Emotional Support Data

Updated 11 March 2026
  • AugESC is a large-scale, heuristically augmented dataset for ESC, offering 45× more sessions than the traditional ESConv corpus.
  • It leverages GPT-J 6B fine-tuning, controlled dialogue generation, and rigorous heuristic filtering to produce coherent and balanced multi-turn dialogues.
  • Evaluations indicate that models trained on AugESC generalize robustly to diverse everyday emotional issues while matching crowdsourced dialogue quality.

AugESC is a large-scale, heuristically augmented dataset for the Emotional Support Conversation (ESC) task, designed to overcome the scale and topical limitations of the existing crowdsourced ESConv corpus by leveraging LLMs for data synthesis. Its construction, characteristics, and evaluation establish AugESC as a resource for robust training and generalization of dialogue systems providing emotional support across a broad range of everyday issues (Zheng et al., 2022).

1. Motivation and Objectives

ESC models require extensive, high-quality multi-turn dialogues, in which a "supporter" addresses the emotional distress described by a "seeker." The widely used ESConv corpus was crowdsourced at high cost, covering only 1.3 K sessions (≈38,000 utterances) spanning 13 narrowly defined topics (e.g., COVID-19, job loss), limiting the generalization of ESC models to diverse, open-domain problems.

AugESC was developed to:

  • Expand the available ESC training data by 45× while matching the qualitative characteristics of ESConv,
  • Vastly broaden topical coverage to include a wide range of everyday stressors, and
  • Enable superior model generalization to open-domain emotional issues, by synthetically constructing full dialogues from real “starter posts” using a LLM.

2. Augmentation Methodology

Dialogue augmentation is formulated as a conditioned completion process, encompassing three key phases: model fine-tuning, dialogue generation, and rigorous heuristic filtering.

2.1 Task Formalization

Let U0U_0 be a "starter post" describing an emotional concern and II an instruction prefix specifying ESC context and roles (seeker as “Human:”, supporter as “AI:”). The system generates a sequence:

I, Human: U0, AI: ?I,\ \text{Human: } U_0,\ \text{AI:}\ ?

The model outputs alternating utterances U1,...,UTU_1, ..., U_T (with TT variable), stopping at the end-of-sequence marker.

2.2 Model Configuration

  • Base Model: GPT-J 6B, an open-source transformer-based LM.
  • Fine-tuning: 100 ESConv sessions (balanced topics), formatted as ESC tasks. Fine-tuned for 1 epoch, batch size 2, learning rate 5×1065\times10^{-6}, AdamW optimizer, warmup steps 5, max input 1500 tokens, with gradient checkpointing.
  • Generation: Nucleus sampling (p=0.9p = 0.9), max output length 1500 tokens, repetition penalty 1.05.
  • Starter Posts: 8,950 negative-emotion utterances from EmpatheticDialogues (length 10–60 tokens). Each serves as input for 10 generation attempts, yielding 89,500 raw dialogue candidates.

2.3 Heuristic Postprocessing

Filtering ensures coherent, balanced, and ESConv-compatible dialogues. Major criteria:

  • Augmentation failures: Remove outputs with non-dialogue lines, missing end-of-sequence, or prompt leakage (intra-utterance role tags).
  • Self-reinforcement avoidance: Exclude if one speaker has >2.5×>2.5\times as many turns as the other or if the same speaker speaks >>3 times consecutively.
  • Distributional fit: Enforce 10\geq10 turns per dialogue; average seeker/supporter turns within length bounds ([6, 40]/[8, 40] tokens); max turn 80\leq80 tokens.

Post-filtering, 65,000 dialogues (72.7% retention) meet all standards. The retention ratio is R=retainedraw65,00089,500=0.727R = \frac{\lvert \text{retained} \rvert}{\lvert \text{raw} \rvert} \approx \frac{65,000}{89,500} = 0.727

For topic analysis, the informative Dirichlet log-odds ratio δw\delta_w is calculated for word ww between corpora DD and CC:

δw=logfw,D+αND+αVfw,Dαlogfw,C+αNC+αVfw,Cα\delta_w = \log\frac{f_{w,D}+\alpha}{N_D+\alpha V-f_{w,D}-\alpha} - \log\frac{f_{w,C}+\alpha}{N_C+\alpha V-f_{w,C}-\alpha}

where ff is frequency, NN is total tokens, VV is vocabulary size, and α\alpha the Dirichlet prior.

3. Dataset Characteristics

3.1 Scale

Corpus #Sessions #Utterances Avg. Turns/Session Avg. Tokens/Turn
ESConv 1,300 38,000 28.9 18.8
AugESC 65,000 1,738,000 26.7 18.7

AugESC constitutes a factor of 45 increase over ESConv in terms of session number, with similar dialogue depth and utterance length.

3.2 Topic and Lexical Diversity

Topic coverage is assessed by the top-30 δ>2|\delta|>2 words:

  • ESConv: Dominated by COVID-19, health, work loss, and crowdsourcing artefacts (e.g., "pandemic", "covid", "zoom", "mturk").
  • AugESC: Displays diverse everyday topics, e.g., "car," "dog," "house," "money," "neighbors," "parents," signifying broader content coverage.

TF-IDF pairwise cosine similarity distributions between dialogues reveal that both ESConv and AugESC have comparably low inter-dialogue similarity, indicating the preservation of topical diversity. PCA visualization of TF-IDF embeddings shows AugESC introduces novel dialogue clusters distinct from ESConv while retaining partial overlap.

4. Evaluation and Benchmarking

4.1 Human Quality Assessment

Randomly sampled dialogue subsets (60 per method; 3 annotators; 0–3 Likert scale) compare:

  • ESConv (crowdsourced)
  • Simulated chat: BlenderBot-1.4B, GPT-J-6B
  • LLM-only (GPT-3, no fine-tuning)
  • AugESC (GPT-J-6B + fine-tuning)

Metrics: Informativeness, Understanding, Helpfulness (ESC-specific), Consistency, Coherence, Unsafety (lower is better).

Results:

AugESC is comparable to ESConv across all metrics, each within 0.1–0.2 points (maximum: 3.0). It outperforms simulated chat and un-tuned LLM baselines significantly (p<0.01p<0.01, Student’s t-test) and exhibits unsafety rates on par with ESConv.

Metric ESConv AugESC (Full)
Informativeness 2.52 2.41
Understanding 2.42 2.37
Helpfulness 2.23 2.12
Consistency 2.56 2.34
Coherence 2.42 2.19
Unsafety low low

4.2 Generalization via Interactive Dialogue

Two 1.4B BlenderBot models:

  • (A) Finetuned only on ESConv
  • (B) Further post-trained on AugESC

N=60 participants conducted matched open-domain ESC chats (≥8 turns each) with both systems and chose which was superior on fluency, identification (empathetic understanding), comforting, suggestion, and overall support.

Dimension Model B Win (%) Model A Win (%)
Fluency 47 13
Identification 68 22
Comforting 55 22
Suggestion 58 15
Overall 58 28

All differences are statistically significant (sign test, p<0.05/0.01p<0.05/0.01). This evidences that models trained on AugESC generalize more robustly to previously unseen support topics with minimal trade-off.

4.3 In-Domain Automatic Metrics

On the ESConv held-out test (200 sessions), ESConv-trained versus AugESC-posttrained models show only negligible changes in:

  • Perplexity (PPL): 11.2 → 11.5
  • BLEU-2/4: 7.8/2.4 → 7.7/2.4
  • ROUGE-L: 16.9 → 16.7
  • Distinct-2/3: 23.8/48.0 → 24.3/49.4

This confirms that open-domain gains do not compromise performance in the original ESC domain.

5. Guidelines and Best Practices for AugESC Usage

  • Fine-tune downstream dialogue models (e.g., BlenderBot-1.4B) for 2–3 epochs on ESConv, then post-train for 1 epoch on AugESC (65 K sessions).
  • Recommended hyperparameters: learning rate ≈ 5×1065\times10^{-6}, batch size ≈ 16–32, AdamW optimizer, warmup ≈ 5% steps, nucleus sampling pp ≈ 0.9, repetition penalty ≈ 1.05.
  • Heuristic filtering was employed; domain-specific toxicity or bias verification is advised prior to deployment.
  • AugESC can be combined with in-domain or task-specific corpora via standard fine-tuning pipelines such as HuggingFace/Transformers.

6. Impact and Research Implications

AugESC establishes a scalable and rigorously filtered augmentation pipeline for ESC data synthesis, enabling a 45-fold increase in training material and dramatically wider topical breadth versus legacy ESC datasets. Its empirical validation (human and interactive evaluation) demonstrates that datasets synthesized via fine-tuned LLMs can match crowdsourced benchmarks on dialogue quality, while conferring substantial generalization benefits in downstream ESC models. A plausible implication is that LLM-augmented data generation may become a standard strategy in low-resource, high-quality dialogue domains where scale, diversity, and label richness are limiting factors (Zheng et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AugESC Dataset.