MultiESC: Multi-turn Emotional Support Framework

Updated 6 February 2026

MultiESC is a multi-turn emotional support framework that integrates A*-like lookahead planning, dynamic user-state tracking, and strategy-aware response generation.
It employs a modular architecture with dialogue encoding, emotion-cause extraction, strategy planning, and conditioned response decoding to optimize long-term user well-being.
Empirical evaluations on the ESConv dataset show MultiESC outperforms baselines, notably increasing CIDEr scores and improving human-assessed dialogue quality.

MultiESC refers to a distinctive framework for multi-turn Emotional Support Conversation (ESC), targeting the goal of maximizing user well-being over extended dialogues. It formalizes multi-step strategy planning for emotional support agents by integrating lookahead planning, user-state tracking, and strategy-conditioned response generation within a unified neural architecture (Cheng et al., 2022).

1. Motivation and Overview

Multi-turn ESC requires agents to engage in sustained, context-aware support beyond single-turn empathy exchange. Key technical challenges are (i) selecting appropriate support strategies over a prolonged dialogue horizon to maximize cumulative user relief, and (ii) dynamically modeling evolving user states—capturing shifts in emotion intensity and identifying the underlying causes of distress. MultiESC addresses these by operationalizing a lookahead strategy planner (A*-like search), a fine-grained emotion-cause user-state encoder, and a strategy-aware response decoder. This architecture is designed to optimize for long-term, not just immediate, conversational outcomes.

2. System Architecture

MultiESC is a modular framework, processing each turn $t$ in four sequential stages:

Dialogue Encoder: A Transformer encoder aggregates the $N$ most recent tokens from conversation history $\mathcal{H}_t$ into hidden state $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ .
User-State Encoder: Each user utterance $y_i$ is processed to extract emotion-cause spans $c_i$ via an external detector (e.g., RECCON). Tokens in $[x_i][y_i][c_i]$ are mapped via the sum of word, positional, and emotion (VAD-quantized) embeddings. Another Transformer produces per-round user-state vectors $\mathbf{u}_i$ , with cumulative state matrix $\mathbf{U}_t = [\mathbf{u}_1; \dots; \mathbf{u}_{t-1}]$ .
Strategy Planning Module: For the current state, MultiESC scores all candidate support strategies $s_t \in \mathcal{S}$ by maximizing a composite function $N$ 0 that integrates immediate fit and estimated future utility.
Utterance Decoder: A strategy-conditioned Transformer decoder generates the response $N$ 1, informed by $N$ 2, $N$ 3, and $N$ 4.

The architecture enables joint optimization and interoperability between strategy planning, user modeling, and controlled text generation.

3. Lookahead Strategy Planning

Inspired by A* search, MultiESC's core decision process computes

$N$ 5

where $N$ 6 is negative log-probability from a Strategy Sequence Generator (SSG), and $N$ 7 is a heuristic for expected user feedback after executing $N$ 8. Hyperparameter $N$ 9 (set to 0.7) modulates trajectory bias.

The ideal $\mathcal{H}_t$ 0 would marginalize over all possible future support strategies $\mathcal{H}_t$ 1: $\mathcal{H}_t$ 2 In practice, MultiESC restricts the lookahead to $\mathcal{H}_t$ 3 turns (set to 2), considers top- $\mathcal{H}_t$ 4 probable continuations via beam search, and computes $\mathcal{H}_t$ 5 as a weighted sum over these: $\mathcal{H}_t$ 6 The SSG is a masked Transformer decoder with multi-source cross-attention. The User-Feedback Predictor (UFP) encodes candidate strategy sequences and user-state histories via a Transformer and LSTM, aggregates via an attention mechanism, and produces feedback scores for heuristic estimation.

4. Dynamic User-State Modeling

User state is encoded as follows:

An emotion-cause extractor detects the text spans $\mathcal{H}_t$ 7 that trigger expressed emotions.
Each token is embedded as $\mathcal{H}_t$ 8, with $\mathcal{H}_t$ 9 assigned by VAD lexicon binning.
A Transformer encodes this representation, and the resulting $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ 0 vector $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ 1 is treated as the turn-level user state.

The cumulative user state matrix supports both immediate context and long-range emotion consistency. Emotion-cause tracking enables the system to distinguish between surface affect and the underlying drivers of user distress.

5. Response Generation and Training

The response decoder is a strategy-conditioned Transformer, receiving strategy embeddings $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ 2 as prepended token vectors. The architecture is identical to that of the SSG to promote information sharing. Training involves:

Joint training of SSG and decoder, optimizing $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ 3.
Separate training of the UFP on mean squared error loss $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ 4.

Key hyperparameters include $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ 5 (lookahead), $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ 6 (beam width), $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ 7 (embedding/hidden dimension), batch size 32, $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ 8 optimizer at $\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}$ 9.

6. Evaluation and Comparative Results

MultiESC is evaluated on the ESConv test set with both automatic metrics and human interaction studies.

Automatic Dialogue Metrics

Model	PPL↓	BLEU-4↑	ROUGE-L↑	METEOR↑	CIDEr↑
BlenderBot-Joint	16.8	1.66	17.94	7.54	18.04
GLHG	15.7	2.13	16.37	–	–
MultiESC	15.4	3.09	20.41	8.84	29.98

MultiESC outperforms baselines (notably BlenderBot-Joint) on all major metrics, especially CIDEr (+11.9 over BlenderBot-Joint).

Strategy Planning and Feedback

Model	Acc↑	W-F1↑	Feedback↑
BlenderBot-Joint	29.9%	29.6	3.05
MISC	31.6%	–	–
MultiESC	42.0%	34.0	3.85

MultiESC achieves 10.1 percentage points improvement in top-1 accuracy over MISC and improves predicted feedback by +0.80.

Human Interactive Evaluation

In 128 role-played dialogues, MultiESC demonstrates higher win rates on all dimensions (fluency, empathy, identification, suggestion, overall effectiveness), with an overall win rate of 58.6% vs. BlenderBot-Joint.

A case study reveals that lookahead planning can bias strategy choice from generic empathy or premature advice to more context-seeking behaviors (e.g., selecting "Question" strategy before issuing advice), which aligns with counseling best practices.

7. Implications and Significance

MultiESC establishes a paradigm for multi-turn emotional support systems that integrates explicit lookahead planning and fine-grained user-state modeling within Transformer-based architectures. The explicit incorporation of A*-like planning heuristics enables more effective, contextually grounded strategy selection, yielding improved dialogue coherence and support efficacy. The emotional-cause user-state encoding offers a mechanism for granular, cause-aware empathetic response, advancing the state of the art in emotionally intelligent dialogue systems.

The technical contributions are broadly applicable to domains requiring long-term dialogue objectives and fine-grained user modeling, including counseling, social chatbots, and assistive technology. MultiESC's empirical results substantiate the claim that long-term planning with user feedback estimation can materially enhance both quantitative and qualitative support effectiveness (Cheng et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Improving Multi-turn Emotional Support Dialogue Generation with Lookahead Strategy Planning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiESC.