Papers
Topics
Authors
Recent
Search
2000 character limit reached

TEA-Dialog: Tool-Enhanced ESC Dataset

Updated 2 February 2026
  • TEA-Dialog is a dataset of tool-enhanced emotional support dialogues that combines affective empathy with fact-grounded instrumental support.
  • It is constructed through rigorous filtering, zero-hallucination criteria, and human validation to ensure high process-level quality.
  • Experimental findings reveal that while in-domain performance improves significantly, tool integration effectiveness is highly dependent on model capacity and struggles with out-of-domain generalization.

TEA-Dialog is a dataset of tool-enhanced emotional support conversation (ESC) dialogues, released in the context of the TEA-Bench framework for evaluating LLMs augmented with external tool use in emotionally supportive, multi-turn dialogues. It is designed to address critical limitations of previous ESC datasets, particularly the absence of instrumental, fact-grounded support and the inability to systematically study model generalization or the capacity-dependence of tool integration (Sui et al., 26 Jan 2026).

1. Motivation and Role within TEA-Bench

TEA-Dialog was introduced to enable rigorous, process-level assessment of LLM-based dialogue agents that provide both affective empathy and instrumental, factually-grounded support using external APIs. Legacy ESC datasets and benchmarks focus nearly exclusively on verbal empathy and text-only settings, which leads to generic, hallucinated, or untrustworthy guidance. TEA-Bench, as the first interactive benchmark in this area, requires multi-turn conversations involving both affective and instrumentally-supported responses, realistic user simulation, and scenario-grounded tool interaction. TEA-Dialog operationalizes this vision by supplying curated, high-quality examples for supervised fine-tuning and evaluation (Sui et al., 26 Jan 2026).

2. Construction Methodology

2.1 Data Selection Pipeline

TEA-Dialog consists of 365 dialogues culled from TEA-Bench episodes—complete, tool-augmented ESC conversations generated in the benchmark's MCP-style (Model Context Protocol) tool environment. The construction process comprises:

  • Initial filtering for high process-level quality: Only dialogues with TEA Score ≥ 80 (on a 0–100 scale) are retained.
  • Strict factuality criterion: All candidate dialogues must exhibit zero automatic hallucination, as determined by the benchmark’s hallucination detection protocol.
  • Human validation: Authors further filter for coherence and consistency, ensuring both affective and instrumental support are contextually appropriate.

2.2 Dialogue and Episode Attributes

  • 365 total dialogues: 320 action-oriented (user seeks practical suggestions), 45 emotion-oriented (user wants emotional validation first).
  • 3,400 total utterances (≈ 9.32 turns/dialogue), average model utterance of ~37.9 tokens.
  • 423 episodes contain at least one tool call; average 1.16 tool calls per dialogue.

2.3 Scenario and Context Diversity

Dialogues span the 81 realistic scenarios of TEA-Bench, constructed from a four-stage pipeline: (i) filtering ExTES scenarios for information-rich contexts, (ii) extracting latent attributes (user location, local time, place type) via LLM inference, (iii) grounding context using real-world APIs (e.g., OpenStreetMap), and (iv) human validation of plausibility.

3. Tool-Augmented Dialogue Format and API Usage

Each TEA-Dialog conversation is embedded in an MCP-style environment. Dialogue agents access 31 tool APIs across seven categories (utils, map, weather, news, Reddit, Wikipedia, music). Tools are invoked via native function calls, enabling retrieval of contextually relevant external information to augment agent responses with concrete, factually-grounded suggestions or knowledge. Agents see the full interaction history and tool outputs; user simulators receive only the agent’s natural-language responses.

4. Process-Level Metrics and Data Splits

TEA-Dialog is labeled with the same process-level metrics as TEA-Bench, including:

  • Diversity, fluency, human-likeness, information quality, and effectiveness (jointly forming the TEA Score).
  • Factuality and hallucination rates, computed per response and aggregated at the dialogue level.

The dataset is split to facilitate rigorous fine-tuning and generalization studies:

  • In-Domain (ID): Sampled from the first 60 seen-in-training scenarios.
  • Out-Of-Domain (OOD): 21 held-out scenarios, never observed during training.

5. Experimental Usage and Key Findings

TEA-Dialog enables targeted supervised fine-tuning (SFT) of LLMs for tool-augmented emotional support. Experiments reported in (Sui et al., 26 Jan 2026) demonstrate:

  • SFT on TEA-Dialog yields significant in-domain improvements (TEA +1.7–3.6 points vs. base) but poor out-of-domain generalization (OOD TEA drops –2 to –3, with increased hallucination rates).
  • Quality gains from tool-enabled support are highly capacity-dependent: strong LLMs use tools selectively for maximal grounding, while weaker models frequently misuse or underuse tools, sometimes harming dialogue quality.
  • Action-oriented user simulations drive stronger information/effectiveness improvements from tool augmentation; emotion-oriented cases show mixed or negative effects for low-capacity models, highlighting the need for context-sensitive tool integration.

6. Significance and Research Impact

TEA-Dialog addresses the longstanding gap in ESC research regarding the calibration of dialogue agents that combine empathy with reliable, fact-grounded guidance. It provides a reliable, standardized evaluation bed for SFT and algorithmic research on tool use, model generalization, and capacity dependence.

The primary research conclusions are:

  • Instrumental support from external tools is essential to gain user trust in high-stakes emotional support dialogue.
  • Model capacity critically governs the effective integration of tool calls—only sufficiently strong models acquire the selective reasoning needed to deploy tools for high-value, minimal-intrusion support.
  • Naïve fine-tuning on high-quality, tool-augmented dialogues alone will generally improve performance on familiar (in-domain) settings but fails to generalize or may worsen hallucination under domain shift.

7. Prospects and Future Directions

Based on findings from interaction with TEA-Dialog, future research directions include:

  • Development of more robust SFT and RLHF strategies for tool-augmented ESC, targeting distributionally robust generalization.
  • Longitudinal evaluation with real user populations to measure the trust-building impact of instrumental, tool-enabled support over extended conversations.
  • Expansion of scenario diversity and tool suite coverage to better stress-test multi-domain generalization and reduce hallucination rates in OOD settings.

TEA-Dialog thus constitutes a foundational resource for the design, training, and evaluation of next-generation emotional support dialogue agents with robust, trustworthy tool integration (Sui et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TEA-Dialog.