Discourse Diversity in Multi-Turn Empathic Dialogue

Published 13 Apr 2026 in cs.CL and cs.AI | (2604.11742v1)

Abstract: LLMs produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formulaicity extends to the level of discourse moves, i.e., what a response does for the person it is addressing. This question is especially consequential for empathic dialogue, where effective support demands not just a kind response at one moment but varied strategies as a conversation unfolds (Stiles et al., 1998). Indeed, prior work shows that LLMs reuse the same tactic sequences more than human supporters in single-turn settings (Gueorguieva et al., 2026). We extend this analysis to multi-turn conversations and find that the rigidity compounds: once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). This pattern holds across LLMs serving as supporters in real emotional support conversations, and is invisible to standard similarity metrics. To address this gap, we introduce MINT (Multi-turn Inter-tactic Novelty Training), the first reinforcement learning framework to optimize discourse move diversity across multi-turn empathic dialogue. The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures. These results suggest that what current models lack is not empathy itself, but the ability to vary their discourse moves across a conversation.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates that LLMs reuse discourse tactics nearly twice as often as humans, leading to formulaic and less engaging support.
The paper introduces MINT, a reinforcement learning framework that optimizes cross-turn tactic diversity using KL divergence and entropy measures.
Experimental results reveal a 26.3% drop in tactic stickiness and a 25.3% empathy quality boost, closely aligning LLM behavior with human conversational dynamics.

Discourse Diversity in Multi-Turn Empathic Dialogue: Technical Summary

Problem Framing and Motivation

LLMs have achieved state-of-the-art human-perceived empathy in single-turn empathic dialogue settings. Despite fine-tuning for support tasks, empirical analyses demonstrate that LLMs’ outputs become increasingly formulaic, reusing not only lexical and syntactic constructions but also repetitive discourse moves (categories such as advice, validation, questioning) at substantially higher rates than human supporters. Empathic support in multi-turn dialogue, however, demands more than isolated warmth; it requires strategic variation and responsive adaptation to the seeker’s evolving needs over the course of the conversational arc. Previous work analyzed repetitiveness only in single-turn or short horizon setups; this study extends the scope to full multi-turn human–LLM dialogues using a rigorous, psychology-grounded taxonomy of empathy tactics.

Quantitative analysis reveals that, in actual emotional support conversations, LLMs repeat their previous turn’s tactics with a probability of 0.50–0.56—nearly double that of humans (0.27)—a disparity invisible to standard surface-form and semantic similarity metrics. This rigidity undermines quality of interaction and impacts user re-engagement, as confirmed by negative correlations between tactic stickiness and user satisfaction in held-out SENSE-7 data. These findings highlight a critical deficiency: procedural empathy in LLMs lacks cross-turn strategic diversity, which is integral for credible, sustained emotional support.

Figure 1: As the seeker’s needs evolve, vanilla LLMs recycle a narrow tactic set, whereas MINT adapts its discourse moves across turns.

Empathy Tactic Taxonomy and Analysis

The foundation for empirical investigation is a granular, codified taxonomy of ten empathy tactics derived from psychological literature (validation, advice, questioning, information, paraphrasing, reappraisal, empowerment, self-disclosure, emotional expression, assistance). Tactics are annotated at the sentence level by human experts, with inter-annotator F1 averaging 0.80, and then used to fine-tune Llama3.1-8B adapters for large-scale automatic tactic detection.

Applying these taggers to two large public multi-turn support corpora (WildChat, SENSE-7) and a human–human evaluation benchmark (Lend-an-Ear), the study examines both tactic prevalence (use within a turn) and stickiness (reuse from turn to turn). LLMs show marked over-reliance on advice and information (up to 89% and 80% of turns, respectively), underuse questioning (as low as 25% vs. 42% for humans), and display the highest stickiness for the most overused tactics, creating a self-reinforcing loop of repetitive support moves.

Figure 2: Left: Tactic prevalence per turn—the overuse of advice/information and underuse of questioning by LLMs. Right: Tactic stickiness—LLMs are much more likely than humans to repeat a tactic used in the previous turn.

Lexical overlap (bigram, BLEU-2) and semantic similarity (BERTScore) between consecutive turns do not distinguish LLMs from humans in this dimension, underscoring that tactic-level analysis is required for diagnosing and mitigating formulaicity at the discourse function level.

MINT: Multi-turn Inter-tactic Novelty Training

To address excessive cross-turn tactic repetition, the authors propose MINT (Multi-turn Inter-tactic Novelty Training), a reinforcement learning (RL) framework explicitly optimizing for diversity of discourse move trajectories over dialogue history. The MINT reward augments traditional empathy quality signals with a quantitatively defined cross-turn tactic diversity objective. Specifically, for policy $\pi_\theta$ :

The tactic profile of each candidate response is scored by the KL divergence $D_{\mathrm{KL}}(Q_t \parallel P_{t-1})$ between the current and previous turn's tactic distributions (cross-turn novelty), complemented optionally with the entropy $H(Q_t)$ of the current tactic distribution (within-turn breadth).
The joint reward is $R = Q + \lambda [\gamma_\mathrm{KL} D_\mathrm{KL} + \gamma_\mathrm{ent} H]$ , with all terms min-max normalized within rollout.
Policy optimization is performed using Group Relative Policy Optimization (GRPO), with a rigorous constraint on message format and length to avoid gaming the reward via verbosity.
Figure 3: Overview of MINT. Step 1: Tactic-tagged multi-turn context; Step 2: Policy generates a new response, tagger labels sentences; Step 3: KL divergence and entropy form a quality-weighted diversity reward for RL fine-tuning.

Experimental Results: Efficacy of Cross-Turn Diversity

Rigorous ablation experiments using Qwen3-1.7B and Qwen3-4B LLMs compare MINT against:

Standard prompt enhancements (system prompting with tactic definitions and usage history).
Inference-time diversity methods (Verbalized Sampling).
Quality-only RL (PsychoCounsel reward).
Token-level entropy-based diversity RL (R1-Zero-Div).

Evaluations use two orthogonal axes: (1) multi-dimensional empathy quality assessed turn-wise via Lend-an-Ear (with gpt-oss-120b judge, $\kappa_w = 0.58$ to expert median), and (2) tactic stickiness (measuring minority class performance against human reference).

Empathic quality improves substantially with MINT: average aggregate empathy up by 25.3% over Vanilla, and up to 4.67 on a 5-point scale for 4B models. Critical, however, is a 26.3% reduction in tactic stickiness—a generalization unattainable by quality-only or token-level diversity RL, confirming that diversity must be promoted at the level of discourse function, not just token sequence. MINT yields fewer tactics per turn than token-entropy baselines but much greater cross-turn adaptation, closely matching human tactical dynamics. An illustrative outcome is the dramatic (up to 44.8%) decrease in unsolicited advice and near doubling of elicitation tactics such as encouraging elaboration compared to prompt baselines.

Qualitative analysis further demonstrates that MINT-generated supporter turns natively eschew templated reassurance and hollow advice in favor of contextually novel, targeted moves—guiding conversational flow more adaptively as in human expert support.

Implications and Future Directions

This work establishes that LLMs’ current deficits in multi-turn emotional support stem from repetitive discourse maneuvering, not from a lack of surface empathy. The MINT framework opens the door to integrating measurable, learnable cross-turn discourse diversity objectives in RL-based model tuning.

Theoretically, attention to discourse function diversity moves evaluation and optimization beyond surface-level similarity measures, calling for function-level benchmarking in all dialogue-based LLM assessment protocols. Practically, strategic diversification of discourse tactics enables LLMs to sustain human engagement and credibility in support scenarios, likely generalizing to non-empathic open-domain conversation and other sequential decision-making tasks with long-horizon interactional structure.

Extensions may include joint optimization of diversity with personalized user-adaptive policies, live deployment with real user feedback, and refinement or expansion of tactic taxonomies for richer human–AI conversational interplay. The technical stack (publicly released data, code, and high-fidelity taggers) forms a replicable pipeline for cross-lab benchmarking.

Conclusion

Discourse move diversity is both measurable and optimizable at the level of RL training objectives. The MINT framework demonstrates that anchoring diversity at the tactic function level, and not merely surface form, enables LLMs to more convincingly emulate the adaptive, responsive strategies of skilled human supporters across the arc of a dialogue. These findings demand future conversational AI research to foreground discourse function diversity as a core metric for effective, sustained interaction.

Markdown Report Issue