MultiESC: Multi-turn Emotional Support Framework
- MultiESC is a multi-turn emotional support framework that integrates A*-like lookahead planning, dynamic user-state tracking, and strategy-aware response generation.
- It employs a modular architecture with dialogue encoding, emotion-cause extraction, strategy planning, and conditioned response decoding to optimize long-term user well-being.
- Empirical evaluations on the ESConv dataset show MultiESC outperforms baselines, notably increasing CIDEr scores and improving human-assessed dialogue quality.
MultiESC refers to a distinctive framework for multi-turn Emotional Support Conversation (ESC), targeting the goal of maximizing user well-being over extended dialogues. It formalizes multi-step strategy planning for emotional support agents by integrating lookahead planning, user-state tracking, and strategy-conditioned response generation within a unified neural architecture (Cheng et al., 2022).
1. Motivation and Overview
Multi-turn ESC requires agents to engage in sustained, context-aware support beyond single-turn empathy exchange. Key technical challenges are (i) selecting appropriate support strategies over a prolonged dialogue horizon to maximize cumulative user relief, and (ii) dynamically modeling evolving user states—capturing shifts in emotion intensity and identifying the underlying causes of distress. MultiESC addresses these by operationalizing a lookahead strategy planner (A*-like search), a fine-grained emotion-cause user-state encoder, and a strategy-aware response decoder. This architecture is designed to optimize for long-term, not just immediate, conversational outcomes.
2. System Architecture
MultiESC is a modular framework, processing each turn in four sequential stages:
- Dialogue Encoder: A Transformer encoder aggregates the most recent tokens from conversation history into hidden state .
- User-State Encoder: Each user utterance is processed to extract emotion-cause spans via an external detector (e.g., RECCON). Tokens in are mapped via the sum of word, positional, and emotion (VAD-quantized) embeddings. Another Transformer produces per-round user-state vectors , with cumulative state matrix .
- Strategy Planning Module: For the current state, MultiESC scores all candidate support strategies by maximizing a composite function that integrates immediate fit and estimated future utility.
- Utterance Decoder: A strategy-conditioned Transformer decoder generates the response , informed by , , and .
The architecture enables joint optimization and interoperability between strategy planning, user modeling, and controlled text generation.
3. Lookahead Strategy Planning
Inspired by A* search, MultiESC's core decision process computes
where is negative log-probability from a Strategy Sequence Generator (SSG), and is a heuristic for expected user feedback after executing . Hyperparameter (set to 0.7) modulates trajectory bias.
The ideal would marginalize over all possible future support strategies : In practice, MultiESC restricts the lookahead to turns (set to 2), considers top- probable continuations via beam search, and computes as a weighted sum over these: The SSG is a masked Transformer decoder with multi-source cross-attention. The User-Feedback Predictor (UFP) encodes candidate strategy sequences and user-state histories via a Transformer and LSTM, aggregates via an attention mechanism, and produces feedback scores for heuristic estimation.
4. Dynamic User-State Modeling
User state is encoded as follows:
- An emotion-cause extractor detects the text spans that trigger expressed emotions.
- Each token is embedded as , with assigned by VAD lexicon binning.
- A Transformer encodes this representation, and the resulting vector is treated as the turn-level user state.
The cumulative user state matrix supports both immediate context and long-range emotion consistency. Emotion-cause tracking enables the system to distinguish between surface affect and the underlying drivers of user distress.
5. Response Generation and Training
The response decoder is a strategy-conditioned Transformer, receiving strategy embeddings as prepended token vectors. The architecture is identical to that of the SSG to promote information sharing. Training involves:
- Joint training of SSG and decoder, optimizing .
- Separate training of the UFP on mean squared error loss .
Key hyperparameters include (lookahead), (beam width), (embedding/hidden dimension), batch size 32, optimizer at .
6. Evaluation and Comparative Results
MultiESC is evaluated on the ESConv test set with both automatic metrics and human interaction studies.
Automatic Dialogue Metrics
| Model | PPL↓ | BLEU-4↑ | ROUGE-L↑ | METEOR↑ | CIDEr↑ |
|---|---|---|---|---|---|
| BlenderBot-Joint | 16.8 | 1.66 | 17.94 | 7.54 | 18.04 |
| GLHG | 15.7 | 2.13 | 16.37 | – | – |
| MultiESC | 15.4 | 3.09 | 20.41 | 8.84 | 29.98 |
MultiESC outperforms baselines (notably BlenderBot-Joint) on all major metrics, especially CIDEr (+11.9 over BlenderBot-Joint).
Strategy Planning and Feedback
| Model | Acc↑ | W-F1↑ | Feedback↑ |
|---|---|---|---|
| BlenderBot-Joint | 29.9% | 29.6 | 3.05 |
| MISC | 31.6% | – | – |
| MultiESC | 42.0% | 34.0 | 3.85 |
MultiESC achieves 10.1 percentage points improvement in top-1 accuracy over MISC and improves predicted feedback by +0.80.
Human Interactive Evaluation
In 128 role-played dialogues, MultiESC demonstrates higher win rates on all dimensions (fluency, empathy, identification, suggestion, overall effectiveness), with an overall win rate of 58.6% vs. BlenderBot-Joint.
A case study reveals that lookahead planning can bias strategy choice from generic empathy or premature advice to more context-seeking behaviors (e.g., selecting "Question" strategy before issuing advice), which aligns with counseling best practices.
7. Implications and Significance
MultiESC establishes a paradigm for multi-turn emotional support systems that integrates explicit lookahead planning and fine-grained user-state modeling within Transformer-based architectures. The explicit incorporation of A*-like planning heuristics enables more effective, contextually grounded strategy selection, yielding improved dialogue coherence and support efficacy. The emotional-cause user-state encoding offers a mechanism for granular, cause-aware empathetic response, advancing the state of the art in emotionally intelligent dialogue systems.
The technical contributions are broadly applicable to domains requiring long-term dialogue objectives and fine-grained user modeling, including counseling, social chatbots, and assistive technology. MultiESC's empirical results substantiate the claim that long-term planning with user feedback estimation can materially enhance both quantitative and qualitative support effectiveness (Cheng et al., 2022).