Emotional Support Conversation Task

Updated 11 March 2026

ESC is a specialized dialogue framework where computational supporters use predefined strategies to provide comfort and encourage adaptive coping.
The task employs a structured methodology combining strategy planning and constrained response generation, bolstered by reinforcement learning with dual-reward optimization.
Key benchmarks and metrics demonstrate that dual-reward fine-tuning improves strategy proficiency and reduces bias, offering a more balanced support approach.

@@@@1@@@@ (ESC) Task

Emotional Support Conversation (ESC) is a specialized dialogue task wherein a computational “supporter” interacts with a human “seeker” who is experiencing emotional or psychological distress. The principal objective is to attenuate negative affect, provide comfort, and encourage adaptive coping by leveraging strategies typically rooted in helping skills theory and counseling micro-skills. Contemporary ESC system design is informed by a rich taxonomy of support strategies and an explicit focus on achieving both cognitive and affective alignment with distressed users (Zhou et al., 16 Sep 2025).

1. Formal Definition and Task Structure

The ESC task is framed as a sequence of multi-turn exchanges between seeker and supporter, with each supporter turn guided by a specific support strategy. At time $t$ , given dialogue context $h_t$ (all prior utterances and relevant metadata), the model must select an appropriate support strategy $s_t$ and generate a response $r_t$ that fulfills the functional requirements of $s_t$ :

Strategy Planning: $\hat{s}_t = \arg\max_{s} P_\theta(s|h_t)$
Strategy-Constrained Generation: $\hat{r}_t = \arg\max_{r} P_\theta(r|h_t, \hat{s}_t)$

Strategies are discrete actions such as “Question” (exploration), “Reflection of Feelings,” “Affirmation and Reassurance” (comforting), “Providing Suggestions,” “Information,” and others, explicitly mapped to stages of the support process: Exploration, Comforting, and Action (Zhou et al., 16 Sep 2025, Liu et al., 2021). ESC models thus address a joint classification–generation problem, requiring both accurate strategic intent and high-quality, contextually appropriate language realization.

2. Strategy Planning and Preference Bias

Extensive studies show that LLMs—though powerful at generic conversation—perform suboptimally when tasked with ESC-specific strategy planning. In practice, LLMs, including proprietary and open-source backbones, exhibit two persistent deficits:

Low Strategy Accuracy: Models frequently select strategies that diverge from human expert judgments, especially in early (exploration) or advanced (action) stages of support (Zhou et al., 16 Sep 2025, Kang et al., 2024).
Preference Bias: There is a pronounced bias toward a small subset of “safe” strategies such as generic comfort or reassurance. Empirical distributions often show heavy overuse of “Affirmation & Reassurance” at the expense of exploration and actionable guidance, as quantified by the deviation from an ideal uniform strategy distribution (e.g., Bradley-Terry preference models and JS-distance) (Zhou et al., 16 Sep 2025, Kang et al., 2024).

This bias results in low support diversity and degraded seeker satisfaction, as key stages of the support process become neglected.

3. Knowledge Boundaries and Uncertainty Estimation

The origins of strategy preference bias are traced to the uneven “knowledge boundaries” in an LLM’s pretrained representations. For any given ESC context $h_i$ , the internal distribution over strategies reveals whether the model can:

Highly Known: Correctly and confidently select the ground-truth strategy in all samples ( $c_i=1$ ).
Weakly Known: Partially predict the correct strategy ( $0 < c_i < 1$ ).
Unknown: Never select the correct strategy ( $c_i=0$ ).

These regions are estimated empirically by sampling $K=10$ candidate responses under stochastic decoding (e.g., $T=0.4$ ), extracting strategy labels, and measuring both empirical accuracy $c_i$ and entropy-based confidence $e_i = -\sum_{s} p(s|h_i) \log p(s|h_i)$ (Zhou et al., 16 Sep 2025). Overconfident errors are observed in unknown regions, and underconfidence emerges where supporting evidence is weak.

This framework systematically partitions the ESC sample space, enabling targeted diagnostics and interventions for both model error and preference collapse.

4. Reinforcement Learning with Dual Reward and Bias Mitigation

To mitigate strategy bias and enhance proficiency, advanced fine-tuning employs reinforcement learning with a dual reward function sensitive to knowledge boundaries:

Accuracy Reward ( $R_{\mathrm{acc}}$ ): Alignment with the correct strategy, proportional to $c_i$ .
Entropy/Confidence Reward ( $R_{\mathrm{conf}}$ ): Shannon entropy over predicted strategies, normalized by $\log|\mathcal{S}|$ .

A region-aware composite reward $r_{\mathrm{region}}$ is assigned based on region type:

For highly/weakly known ( $c_i>0$ ): $r_{\mathrm{region}}(h_i) = 1 - (e_i/\log|\mathcal{S}|)$
For unknown ( $c_i=0$ ): $r_{\mathrm{region}}(h_i) = (e_i/\log|\mathcal{S}|)$

The policy update objective combines these with a consistency KL-penalty to regularize against the base model:

$r(h_i, \hat{y}) = R_{\mathrm{acc}}(h_i) + r_{\mathrm{region}}(h_i) - \beta\, \mathrm{KL}\big[ \pi_\theta(\hat{y}|h_i) \| P_M(\hat{y}|h_i) \big]$

Training proceeds via Group Relative Policy Optimization (GRPO): batches of 256 dialogues, $K=10$ samples per context, learning rate $1e^{-6}$ , and KL-penalty $\beta=0.001$ over 300 episodes (Zhou et al., 16 Sep 2025).

This dual-reward mechanism explicitly encourages strategy diversity in underexplored regions while preserving exploitation in regions where the LLM is already proficient, thus reducing systematic over-reliance on any single strategy.

5. Benchmarks, Metrics, and Evaluation

Datasets

ESCov [Liu et al., 2021]: ~8 support strategies, high-quality crowdworker annotation, standard for strategy-controlled ESC.
ExTES [Zheng et al., 2024]: ~16 strategy types, LLM-generated and human-verified, permitting evaluation of large-scale, synthetic, and more diverse conversational phenomena.

Metrics

Strategy Proficiency: Measured by macro-F1 ( $\mathcal{Q}$ ) and weighted-F1 on strategy annotation.
Strategy Bias ( $\mathcal{B}$ ): Divergence from uniform (JS-distance, or other suitable L $_{1,2}$ metrics).
Semantic Response Quality: ROUGE-L over generated utterances.
Human Evaluation: Acceptance, effectiveness, sensitivity, satisfaction (subjective ratings; inter-rater agreement evaluated via weighted Cohen’s $\kappa$ ).

Results

SFT alone raises $\mathcal{Q}$ dramatically (e.g., from $\sim22\%$ to $40\%$ on ExTES) but leaves $\mathcal{B}$ high ( $\sim1.8$ ).
Generic RL (GRPO) further increases proficiency but only marginally reduces bias.
The proposed dual-reward RL achieves maximum proficiency ( $\mathcal{Q}=47.5\%$ on ExTES, Qwen backbone) and the lowest bias ( $\mathcal{B}=0.64$ ).
ROUGE-L improves ( $+0.7$ points) without loss of utterance fluency.
Human judges consistently prefer dual-reward fine-tuned models, with win-rates of 38–41% across subjective dimensions.

A qualitative shift is observed: post-dual-reward fine-tuning, models distribute support strategies across reflective listening, problem-solving, and self-disclosure, rather than repeatedly defaulting to generic comfort (Zhou et al., 16 Sep 2025).

6. Broader Implications, Limitations, and Future Directions

The identification and exploitation of weakly known contexts drive reductions in overconfidence and improve the diversity of support, suggesting that ESC models benefit from granular knowledge-boundary-aware optimization. Region-aware reward shaping enables models to adaptively balance exploration and exploitation.

Challenges remain:

Realistic sessions are longer and more dynamic than current $\sim$ 10–15–turn dialogs; multi-session phenomena and longitudinal effects are not captured.
Knowledge region definitions may need adaptation for other dialogic or task-oriented settings.
Computational cost for stratified sampling ( $K=10$ ) limits scalability for very large datasets or models.

Ongoing and future research directions include extending these methods to multi-session counseling, generalizing knowledge boundary delineation to negotiation/coaching dialogs, and exploring more efficient, possibly online, uncertainty estimation protocols (Zhou et al., 16 Sep 2025).

Prior approaches to ESC modeling have variously used fine-tuned strategy planners, knowledge-enhanced modules, or new data augmentation techniques. Nevertheless, even with advanced in-context modeling and external knowledge assistance, the core problems of inflexible strategy selection and preference bias—and their roots in pretrained model knowledge boundaries—had not been analytically formalized or addressed via reward shaping until recently.

The presented framework can be juxtaposed with:

External-Contact Mitigation: Integration of few-shot or knowledge-augmented exemplars helps but does not address region-specific model uncertainty (Kang et al., 2024).
Self-Contact Approaches: Model-internal refinements either fail to decrease bias or even exacerbate it.
Explicit Preference Optimization: Other preference-guided learning (e.g., Direct Preference Optimization on turn-level strategy pairs) targets similar deficits but often lacks the fine-grained, knowledge-region-calibrated uncertainty rewards found effective here (Zhao et al., 7 Mar 2025, Zhang et al., 22 May 2025).

Thus, knowledge-boundary-aware dual-reward RL approaches represent a methodological advance in debiasing ESC systems and improving strategic coverage, establishing a blueprint for principled, uncertainty-sensitive alignment in affective dialogue systems (Zhou et al., 16 Sep 2025).