Papers
Topics
Authors
Recent
Search
2000 character limit reached

MultiESC: Multi-turn Emotional Support Framework

Updated 6 February 2026
  • MultiESC is a multi-turn emotional support framework that integrates A*-like lookahead planning, dynamic user-state tracking, and strategy-aware response generation.
  • It employs a modular architecture with dialogue encoding, emotion-cause extraction, strategy planning, and conditioned response decoding to optimize long-term user well-being.
  • Empirical evaluations on the ESConv dataset show MultiESC outperforms baselines, notably increasing CIDEr scores and improving human-assessed dialogue quality.

MultiESC refers to a distinctive framework for multi-turn Emotional Support Conversation (ESC), targeting the goal of maximizing user well-being over extended dialogues. It formalizes multi-step strategy planning for emotional support agents by integrating lookahead planning, user-state tracking, and strategy-conditioned response generation within a unified neural architecture (Cheng et al., 2022).

1. Motivation and Overview

Multi-turn ESC requires agents to engage in sustained, context-aware support beyond single-turn empathy exchange. Key technical challenges are (i) selecting appropriate support strategies over a prolonged dialogue horizon to maximize cumulative user relief, and (ii) dynamically modeling evolving user states—capturing shifts in emotion intensity and identifying the underlying causes of distress. MultiESC addresses these by operationalizing a lookahead strategy planner (A*-like search), a fine-grained emotion-cause user-state encoder, and a strategy-aware response decoder. This architecture is designed to optimize for long-term, not just immediate, conversational outcomes.

2. System Architecture

MultiESC is a modular framework, processing each turn tt in four sequential stages:

  • Dialogue Encoder: A Transformer encoder aggregates the NN most recent tokens from conversation history Ht\mathcal{H}_t into hidden state Ht∈RN×demb\mathbf{H}_t \in \mathbb{R}^{N\times d_{\text{emb}}}.
  • User-State Encoder: Each user utterance yiy_i is processed to extract emotion-cause spans cic_i via an external detector (e.g., RECCON). Tokens in [xi][yi][ci][x_i][y_i][c_i] are mapped via the sum of word, positional, and emotion (VAD-quantized) embeddings. Another Transformer produces per-round user-state vectors ui\mathbf{u}_i, with cumulative state matrix Ut=[u1;… ;ut−1]\mathbf{U}_t = [\mathbf{u}_1; \dots; \mathbf{u}_{t-1}].
  • Strategy Planning Module: For the current state, MultiESC scores all candidate support strategies st∈Ss_t \in \mathcal{S} by maximizing a composite function F(st)F(s_t) that integrates immediate fit and estimated future utility.
  • Utterance Decoder: A strategy-conditioned Transformer decoder generates the response xtx_t, informed by s^t\hat{s}_t, Ht\mathbf{H}_t, and Ut\mathbf{U}_t.

The architecture enables joint optimization and interoperability between strategy planning, user modeling, and controlled text generation.

3. Lookahead Strategy Planning

Inspired by A* search, MultiESC's core decision process computes

F(st)=g(st)+λh(st)F(s_t) = g(s_t) + \lambda h(s_t)

where g(st)g(s_t) is negative log-probability from a Strategy Sequence Generator (SSG), and h(st)h(s_t) is a heuristic for expected user feedback after executing sts_t. Hyperparameter λ\lambda (set to 0.7) modulates trajectory bias.

The ideal h(st)h(s_t) would marginalize over all possible future support strategies s>ts_{>t}: h(st)=E[f(st,s>t,Ut)∣st,Ht,Ut]h(s_t) = \mathbb{E}[f(s_t, s_{>t}, \mathbf{U}_t) | s_t, \mathbf{H}_t, \mathbf{U}_t] In practice, MultiESC restricts the lookahead to LL turns (set to 2), considers top-kk probable continuations via beam search, and computes h(st)h(s_t) as a weighted sum over these: h(st)≈∑j=1kP(s^>t(j)∣st,Ht,Ut)⋅f(st,s^>t(j),Ut)h(s_t)\approx\sum_{j=1}^k P(\hat{s}_{>t}^{(j)}|s_t,\mathbf{H}_t,\mathbf{U}_t) \cdot f(s_t,\hat{s}_{>t}^{(j)},\mathbf{U}_t) The SSG is a masked Transformer decoder with multi-source cross-attention. The User-Feedback Predictor (UFP) encodes candidate strategy sequences and user-state histories via a Transformer and LSTM, aggregates via an attention mechanism, and produces feedback scores for heuristic estimation.

4. Dynamic User-State Modeling

User state is encoded as follows:

  • An emotion-cause extractor detects the text spans cic_i that trigger expressed emotions.
  • Each token is embedded as wi+pi+ei\mathbf{w}_i + \mathbf{p}_i + \mathbf{e}_i, with ei\mathbf{e}_i assigned by VAD lexicon binning.
  • A Transformer encodes this representation, and the resulting [CLS][{\tt CLS}] vector ui\mathbf{u}_i is treated as the turn-level user state.

The cumulative user state matrix supports both immediate context and long-range emotion consistency. Emotion-cause tracking enables the system to distinguish between surface affect and the underlying drivers of user distress.

5. Response Generation and Training

The response decoder is a strategy-conditioned Transformer, receiving strategy embeddings Es(s^t)\mathbf{E}_s(\hat{s}_t) as prepended token vectors. The architecture is identical to that of the SSG to promote information sharing. Training involves:

  • Joint training of SSG and decoder, optimizing Ljoint=Ls+Lg\mathcal{L}_{\text{joint}} = \mathcal{L}_s + \mathcal{L}_g.
  • Separate training of the UFP on mean squared error loss Lf=∥f^−f∗∥2\mathcal{L}_f = \|\hat{f} - f^*\|^2.

Key hyperparameters include L=2L=2 (lookahead), k=6k=6 (beam width), demb=768d_{\text{emb}}=768 (embedding/hidden dimension), batch size 32, AdamW\text{AdamW} optimizer at 5×10−55\times10^{-5}.

6. Evaluation and Comparative Results

MultiESC is evaluated on the ESConv test set with both automatic metrics and human interaction studies.

Automatic Dialogue Metrics

Model PPL↓ BLEU-4↑ ROUGE-L↑ METEOR↑ CIDEr↑
BlenderBot-Joint 16.8 1.66 17.94 7.54 18.04
GLHG 15.7 2.13 16.37 – –
MultiESC 15.4 3.09 20.41 8.84 29.98

MultiESC outperforms baselines (notably BlenderBot-Joint) on all major metrics, especially CIDEr (+11.9 over BlenderBot-Joint).

Strategy Planning and Feedback

Model Acc↑ W-F1↑ Feedback↑
BlenderBot-Joint 29.9% 29.6 3.05
MISC 31.6% – –
MultiESC 42.0% 34.0 3.85

MultiESC achieves 10.1 percentage points improvement in top-1 accuracy over MISC and improves predicted feedback by +0.80.

Human Interactive Evaluation

In 128 role-played dialogues, MultiESC demonstrates higher win rates on all dimensions (fluency, empathy, identification, suggestion, overall effectiveness), with an overall win rate of 58.6% vs. BlenderBot-Joint.

A case study reveals that lookahead planning can bias strategy choice from generic empathy or premature advice to more context-seeking behaviors (e.g., selecting "Question" strategy before issuing advice), which aligns with counseling best practices.

7. Implications and Significance

MultiESC establishes a paradigm for multi-turn emotional support systems that integrates explicit lookahead planning and fine-grained user-state modeling within Transformer-based architectures. The explicit incorporation of A*-like planning heuristics enables more effective, contextually grounded strategy selection, yielding improved dialogue coherence and support efficacy. The emotional-cause user-state encoding offers a mechanism for granular, cause-aware empathetic response, advancing the state of the art in emotionally intelligent dialogue systems.

The technical contributions are broadly applicable to domains requiring long-term dialogue objectives and fine-grained user modeling, including counseling, social chatbots, and assistive technology. MultiESC's empirical results substantiate the claim that long-term planning with user feedback estimation can materially enhance both quantitative and qualitative support effectiveness (Cheng et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MultiESC.