AugESC Dataset: Scalable Emotional Support Data

Updated 11 March 2026

AugESC is a large-scale, heuristically augmented dataset for ESC, offering 45× more sessions than the traditional ESConv corpus.
It leverages GPT-J 6B fine-tuning, controlled dialogue generation, and rigorous heuristic filtering to produce coherent and balanced multi-turn dialogues.
Evaluations indicate that models trained on AugESC generalize robustly to diverse everyday emotional issues while matching crowdsourced dialogue quality.

AugESC is a large-scale, heuristically augmented dataset for the Emotional Support Conversation (ESC) task, designed to overcome the scale and topical limitations of the existing crowdsourced ESConv corpus by leveraging LLMs for data synthesis. Its construction, characteristics, and evaluation establish AugESC as a resource for robust training and generalization of dialogue systems providing emotional support across a broad range of everyday issues (Zheng et al., 2022).

1. Motivation and Objectives

ESC models require extensive, high-quality multi-turn dialogues, in which a "supporter" addresses the emotional distress described by a "seeker." The widely used ESConv corpus was crowdsourced at high cost, covering only 1.3 K sessions (≈38,000 utterances) spanning 13 narrowly defined topics (e.g., COVID-19, job loss), limiting the generalization of ESC models to diverse, open-domain problems.

AugESC was developed to:

Expand the available ESC training data by 45× while matching the qualitative characteristics of ESConv,
Vastly broaden topical coverage to include a wide range of everyday stressors, and
Enable superior model generalization to open-domain emotional issues, by synthetically constructing full dialogues from real “starter posts” using a LLM.

2. Augmentation Methodology

Dialogue augmentation is formulated as a conditioned completion process, encompassing three key phases: model fine-tuning, dialogue generation, and rigorous heuristic filtering.

2.1 Task Formalization

Let $U_0$ be a "starter post" describing an emotional concern and $I$ an instruction prefix specifying ESC context and roles (seeker as “Human:”, supporter as “AI:”). The system generates a sequence:

$I,\ \text{Human: } U_0,\ \text{AI:}\ ?$

The model outputs alternating utterances $U_1, ..., U_T$ (with $T$ variable), stopping at the end-of-sequence marker.

2.2 Model Configuration

Base Model: GPT-J 6B, an open-source transformer-based LM.
Fine-tuning: 100 ESConv sessions (balanced topics), formatted as ESC tasks. Fine-tuned for 1 epoch, batch size 2, learning rate $5\times10^{-6}$ , AdamW optimizer, warmup steps 5, max input 1500 tokens, with gradient checkpointing.
Generation: Nucleus sampling ( $p = 0.9$ ), max output length 1500 tokens, repetition penalty 1.05.
Starter Posts: 8,950 negative-emotion utterances from EmpatheticDialogues (length 10–60 tokens). Each serves as input for 10 generation attempts, yielding 89,500 raw dialogue candidates.

2.3 Heuristic Postprocessing

Filtering ensures coherent, balanced, and ESConv-compatible dialogues. Major criteria:

Augmentation failures: Remove outputs with non-dialogue lines, missing end-of-sequence, or prompt leakage (intra-utterance role tags).
Self-reinforcement avoidance: Exclude if one speaker has $>2.5\times$ as many turns as the other or if the same speaker speaks $>$ 3 times consecutively.
Distributional fit: Enforce $\geq10$ turns per dialogue; average seeker/supporter turns within length bounds ([6, 40]/[8, 40] tokens); max turn $I$ 0 tokens.

Post-filtering, 65,000 dialogues (72.7% retention) meet all standards. The retention ratio is $I$ 1

For topic analysis, the informative Dirichlet log-odds ratio $I$ 2 is calculated for word $I$ 3 between corpora $I$ 4 and $I$ 5:

$I$ 6

where $I$ 7 is frequency, $I$ 8 is total tokens, $I$ 9 is vocabulary size, and $I,\ \text{Human: } U_0,\ \text{AI:}\ ?$ 0 the Dirichlet prior.

3. Dataset Characteristics

3.1 Scale

Corpus	#Sessions	#Utterances	Avg. Turns/Session	Avg. Tokens/Turn
ESConv	1,300	38,000	28.9	18.8
AugESC	65,000	1,738,000	26.7	18.7

AugESC constitutes a factor of 45 increase over ESConv in terms of session number, with similar dialogue depth and utterance length.

3.2 Topic and Lexical Diversity

Topic coverage is assessed by the top-30 $I,\ \text{Human: } U_0,\ \text{AI:}\ ?$ 1 words:

ESConv: Dominated by COVID-19, health, work loss, and crowdsourcing artefacts (e.g., "pandemic", "covid", "zoom", "mturk").
AugESC: Displays diverse everyday topics, e.g., "car," "dog," "house," "money," "neighbors," "parents," signifying broader content coverage.

TF-IDF pairwise cosine similarity distributions between dialogues reveal that both ESConv and AugESC have comparably low inter-dialogue similarity, indicating the preservation of topical diversity. PCA visualization of TF-IDF embeddings shows AugESC introduces novel dialogue clusters distinct from ESConv while retaining partial overlap.

4. Evaluation and Benchmarking

4.1 Human Quality Assessment

Randomly sampled dialogue subsets (60 per method; 3 annotators; 0–3 Likert scale) compare:

ESConv (crowdsourced)
Simulated chat: BlenderBot-1.4B, GPT-J-6B
LLM-only (GPT-3, no fine-tuning)
AugESC (GPT-J-6B + fine-tuning)

Metrics: Informativeness, Understanding, Helpfulness (ESC-specific), Consistency, Coherence, Unsafety (lower is better).

Results:

AugESC is comparable to ESConv across all metrics, each within 0.1–0.2 points (maximum: 3.0). It outperforms simulated chat and un-tuned LLM baselines significantly ( $I,\ \text{Human: } U_0,\ \text{AI:}\ ?$ 2, Student’s t-test) and exhibits unsafety rates on par with ESConv.

Metric	ESConv	AugESC (Full)
Informativeness	2.52	2.41
Understanding	2.42	2.37
Helpfulness	2.23	2.12
Consistency	2.56	2.34
Coherence	2.42	2.19
Unsafety	low	low

4.2 Generalization via Interactive Dialogue

Two 1.4B BlenderBot models:

(A) Finetuned only on ESConv
(B) Further post-trained on AugESC

N=60 participants conducted matched open-domain ESC chats (≥8 turns each) with both systems and chose which was superior on fluency, identification (empathetic understanding), comforting, suggestion, and overall support.

Dimension	Model B Win (%)	Model A Win (%)
Fluency	47	13
Identification	68	22
Comforting	55	22
Suggestion	58	15
Overall	58	28

All differences are statistically significant (sign test, $I,\ \text{Human: } U_0,\ \text{AI:}\ ?$ 3). This evidences that models trained on AugESC generalize more robustly to previously unseen support topics with minimal trade-off.

4.3 In-Domain Automatic Metrics

On the ESConv held-out test (200 sessions), ESConv-trained versus AugESC-posttrained models show only negligible changes in:

Perplexity (PPL): 11.2 → 11.5
BLEU-2/4: 7.8/2.4 → 7.7/2.4
ROUGE-L: 16.9 → 16.7
Distinct-2/3: 23.8/48.0 → 24.3/49.4

This confirms that open-domain gains do not compromise performance in the original ESC domain.

5. Guidelines and Best Practices for AugESC Usage

Fine-tune downstream dialogue models (e.g., BlenderBot-1.4B) for 2–3 epochs on ESConv, then post-train for 1 epoch on AugESC (65 K sessions).
Recommended hyperparameters: learning rate ≈  $I,\ \text{Human: } U_0,\ \text{AI:}\ ?$ 4, batch size ≈ 16–32, AdamW optimizer, warmup ≈ 5% steps, nucleus sampling $I,\ \text{Human: } U_0,\ \text{AI:}\ ?$ 5 ≈ 0.9, repetition penalty ≈ 1.05.
Heuristic filtering was employed; domain-specific toxicity or bias verification is advised prior to deployment.
AugESC can be combined with in-domain or task-specific corpora via standard fine-tuning pipelines such as HuggingFace/Transformers.

6. Impact and Research Implications

AugESC establishes a scalable and rigorously filtered augmentation pipeline for ESC data synthesis, enabling a 45-fold increase in training material and dramatically wider topical breadth versus legacy ESC datasets. Its empirical validation (human and interactive evaluation) demonstrates that datasets synthesized via fine-tuned LLMs can match crowdsourced benchmarks on dialogue quality, while conferring substantial generalization benefits in downstream ESC models. A plausible implication is that LLM-augmented data generation may become a standard strategy in low-resource, high-quality dialogue domains where scale, diversity, and label richness are limiting factors (Zheng et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

AugESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AugESC Dataset.

AugESC Dataset: Scalable Emotional Support Data

1. Motivation and Objectives

2. Augmentation Methodology

2.1 Task Formalization

2.2 Model Configuration

2.3 Heuristic Postprocessing

3. Dataset Characteristics

3.1 Scale

3.2 Topic and Lexical Diversity

4. Evaluation and Benchmarking

4.1 Human Quality Assessment

4.2 Generalization via Interactive Dialogue

4.3 In-Domain Automatic Metrics

5. Guidelines and Best Practices for AugESC Usage

6. Impact and Research Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AugESC Dataset: Scalable Emotional Support Data

1. Motivation and Objectives

2. Augmentation Methodology

2.1 Task Formalization

2.2 Model Configuration

2.3 Heuristic Postprocessing

3. Dataset Characteristics

3.1 Scale

3.2 Topic and Lexical Diversity

4. Evaluation and Benchmarking

4.1 Human Quality Assessment

4.2 Generalization via Interactive Dialogue

4.3 In-Domain Automatic Metrics

5. Guidelines and Best Practices for AugESC Usage

6. Impact and Research Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research