Papers
Topics
Authors
Recent
Search
2000 character limit reached

RealCBT: Real-world CBT Dataset Benchmark

Updated 3 July 2026
  • RealCBT is a curated dataset capturing genuine CBT sessions with detailed emotional trajectories and clinical dialogue features.
  • It employs rigorous data collection from public video platforms with transcription, manual correction, and LLM-based annotation for clinical accuracy.
  • Comparative analyses show that RealCBT exhibits higher emotional variability and authentic dialogue structure compared to synthetic LLM-generated sessions.

RealCBT denotes a curated dataset of real-world Cognitive Behavioral Therapy (CBT) counseling dialogues, introduced as an empirical benchmark for the analysis of affective, structural, and processual properties of genuine therapeutic interactions. As reported in the literature, RealCBT enables research on the emotional, strategic, and clinical fidelity contrasts between authentic human-human CBT sessions and synthetic, LLM-generated dialogues, addressing critical gaps in current mental health NLP benchmarks and model evaluation paradigms (Wang et al., 28 Aug 2025).

1. Origin, Motivation, and Collection Methodology

RealCBT was constructed from public video-sharing platforms (YouTube, Vimeo). The search targeted videos explicitly labeled as CBT-based counseling sessions, using queries such as "CBT counseling," "CBT case study," "CBT role play," and "CBT session." Inclusion criteria required exactly two participants (counselor and client), minimal non-dialogic interference, a clear CBT-consistent focus (e.g., depression, anxiety, behavioral change), and a minimum duration of three minutes.

Preprocessing involved converting videos to standard formats, excising non-dialogue content, and transcribing speech with Transkriptor, followed by meticulous manual transcript correction. Metadata annotation leveraged automated majority-voted LLM outputs (ChatGPT-4o Mini, Grok-v3, Gemini 2.0 Flash), with manual inspection of 30% of dialogues reporting 100% agreement. This protocol yielded a dataset intended to represent authentic clinical interaction, with no IRB approval or explicit licensing reported, relying solely on the public availability of source content (Wang et al., 28 Aug 2025).

2. Dataset Characteristics and Composition

The RealCBT dataset comprises 76 full-length real CBT dialogues. Distribution of client presenting problems is heavily imbalanced: anxiety/fear (32.89%), self-esteem/confidence (25.00%), and relational issues (18.42%) predominate; other concerns are less frequently represented. The gender and affective composition are similarly skewed, with female clients comprising 84.2% and positive client attitudes 90.8% of the sample. Average session length for the top problems subset (58 dialogues) is 15.64 minutes (1,872.38 tokens/session), facilitating high-resolution emotional and conversational analyses.

# Dialogues % Female Clients % Positive Attitude Main Problem Classes
RealCBT 76 84.2 90.8 Anxiety (32.9%), Esteem (25%), Relational (18.4%)

This demographic and topical bias, arising from public video curation, constrains external validity and interpretability for broader clinical populations (Wang et al., 28 Aug 2025).

3. Benchmark Purpose and Analytical Use Cases

RealCBT was created to provide an empirical reference for contrasting real and synthetic CBT counseling on emotionally and clinically central properties. Its major applications include:

  • Analysis of emotional arc dynamics—valence, arousal, and dominance trajectories within and across sessions.
  • Testing the fidelity of LLM-generated dialogues (e.g., from CACTUS) relative to real therapy in affective and semantic expressivity.
  • Validation of metrics and models meant to capture session-level coherence, variability, and subtle clinical phenomena.

The dataset supports annotations and numerical analyses at the dialogue, counselor, and client levels, enabling granular comparison of emotion-laden language, affective variability, reactivity, regulation, and arc similarity (Wang et al., 28 Aug 2025).

4. Findings from Comparative Analyses

Systematic analysis using the Utterance Emotion Dynamics (UED) framework on RealCBT versus synthetic CACTUS dialogues shows:

  • RealCBT dialogues exhibit significantly greater emotional variability (especially arousal SD: p ≪ 0.01, effect size ≈ 0.84).
  • RealCBT speakers (particularly counselors) employ more emotion words and present more authentic patterns of emotional reactivity and regulation. For instance, real clients and counselors both show longer average displacement in arousal and dominance metrics.
  • Quantitative arc similarity (Spearman correlation) between real and synthetic speakers is negligible (≈0.01–0.06), with the greatest divergence for client affective trajectories.
  • Synthetic sessions, while structurally and fluently coherent, consistently underrepresent the range, richness, and dynamism of authentic therapeutic emotional arcs, and often exhibit muted reactivity and inauthentic regulation patterns.

These findings underscore the inadequacy of evaluating LLM-based counseling agents on metrics sensitive only to surface-level fluency or scenario completion, rather than affective realism and clinical process fidelity (Wang et al., 28 Aug 2025).

5. Significance in Model Evaluation and Development

RealCBT’s empirical grounding positions it as a high-impact resource for the evaluation of therapeutic dialogue systems:

  • It exposes deficiencies of synthetic dialogue corpora, especially regarding emotional trajectory realism and entity density (Qin et al., 3 Jun 2026).
  • It enables the construction and calibration of new metric frameworks, such as resistance-aware evaluation schemes and session-level reward models, by serving as a "gold standard" for contextual, emotionally dynamic interaction.
  • Recent benchmarks (e.g., PRMB (Zhou et al., 12 Mar 2026)) use "RealCBT-style" alignment as a reference for defining long-horizon, process-aware, and harm-detection capabilities, moving beyond one-turn or pairwise judgment paradigms.

In broader therapeutic AI research, RealCBT is also cited as essential for model validation on emotional sensitivity, regulation authenticity, and scenario-specific clinical reasoning (Wang et al., 28 Aug 2025, Qin et al., 3 Jun 2026).

6. Limitations and Future Directions

The central limitations of RealCBT reflect its collection constraints: small corpus size (76 dialogues), female and positivity bias, narrow topical diversity, and lack of granular demographic or session attribute controls. Access is at present only as stated release, with no licensing details.

Future expansions recommended in the literature include:

  • Collecting larger, demographically and clinically diverse real counseling corpora.
  • Ensuring more balanced representation across therapy themes, attitudes, and speaker identities.
  • Developing methodologies for aligning synthetic dialogue systems with the affective and processual distributional properties of RealCBT, especially with respect to session-level emotional variability and clinical grounding (Wang et al., 28 Aug 2025).

7. Role in Benchmarking and Clinical Model Alignment

RealCBT serves as an essential empirical anchor for progressing from artificial, overly compliant, or emotionally flat client models toward strategic, resistance-aware, and emotionally nuanced evaluation paradigms. For instance, recent frameworks such as CARS and STREAMS ground their simulation and evaluation on the contrast between the high entity-density, resistance episodes, and affective arcs seen in RealCBT and the often inadequately challenging synthetic scenarios (Qin et al., 3 Jun 2026). Its use as a benchmark underscores the need for process-oriented, emotion-aware, and harm-sensitive model assessment in computational psychotherapy and therapeutic NLP.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RealCBT.