THERAPYJUDGEBENCH: CBT Judge Calibration

Updated 4 July 2026

THERAPYJUDGEBENCH is a clinician-annotated benchmark that evaluates therapy chatbots using complete CBT session dialogues and clinical fidelity measures.
It provides a structured dialogue bank of 116 synthetic CBT sessions with 1,270 expert ratings to support judge calibration and model selection.
The benchmark is integral to reinforcement learning finetuning by validating LLM-based evaluators against licensed clinician judgments on fidelity and safety.

THERAPYJUDGEBENCH is a clinician-annotated audit and calibration benchmark for therapy-session judges introduced within the TherapyGym framework for evaluating and improving therapy chatbots along the clinical pillars of fidelity and safety. Its stated role is not to benchmark therapist policies directly, but to validate and calibrate LLM-based evaluators in multi-turn CBT settings against licensed clinician judgment before those evaluators are used as automated rewards or auditing tools (Huang et al., 23 Feb 2026). In that sense, THERAPYJUDGEBENCH is best understood as infrastructure for judge evaluation: it supplies dialogue-level human reference data, supports prompt and model selection for TherapyJudge, and determines which dimensions are reliable enough to optimize downstream (Huang et al., 23 Feb 2026).

1. Position within TherapyGym

Within TherapyGym, THERAPYJUDGEBENCH occupies the left-side “judge benchmark panel” in the workflow that connects human evaluation, automated judging, and RL finetuning (Huang et al., 23 Feb 2026). The framework is organized around three components: THERAPYJUDGEBENCH as the benchmark substrate, TherapyJudge as the LLM evaluator validated against that substrate, and an RL finetuning stage in which a therapist model is optimized using the validated judge (Huang et al., 23 Feb 2026). This positioning makes THERAPYJUDGEBENCH foundational rather than auxiliary: it is the mechanism by which judge bias, unreliability, and miscalibration are audited before automated evaluation is allowed to influence model training (Huang et al., 23 Feb 2026).

The underlying motivation is that therapy is a high-stakes, multi-turn, processual domain, and generic chat metrics or unconstrained preference judgments do not capture clinically relevant properties of psychotherapy (Huang et al., 23 Feb 2026). The benchmark therefore addresses a specific methodological problem: whether an LLM judge can recover clinician rankings on therapy fidelity and therapy-specific safety well enough to be usable as a noisy shaping reward, while remaining subordinate to blinded clinician evaluation for final assessment (Huang et al., 23 Feb 2026).

A useful implication is that THERAPYJUDGEBENCH formalizes judge validation as a prerequisite for scalable therapy-agent development. This differs from settings in which an LLM evaluator is inserted directly into a pipeline without an explicit human-anchored calibration stage. Related work on therapy-oriented judging reinforces the importance of such calibration: ESC-Judge evaluates transcript-level emotional-support quality through pairwise, rubric-anchored comparisons grounded in Clara Hill’s Exploration–Insight–Action model (Madani et al., 18 May 2025), while FAITH-M and CARE operationalize utterance-level therapeutic-principle judgments on expert-supervised ordinal labels (Mazhar et al., 7 Apr 2026). THERAPYJUDGEBENCH is narrower in modality, but more explicit about using human agreement statistics to authorize limited downstream use of the judge (Huang et al., 23 Feb 2026).

2. Dialogue bank, generation process, and annotation targets

The benchmark contains 116 dialogues and, as reported in the abstract, 1,270 expert ratings (Huang et al., 23 Feb 2026). Each dialogue is a complete CBT-style session of 10 turns total, specifically 5 patient turns and 5 therapist turns, a horizon selected as a balance between validity and tractability for exposing session-level CBT micro-skills while keeping generation, expert annotation, and RL practical (Huang et al., 23 Feb 2026).

The dialogues are synthetic but clinically anchored. They are generated through interaction between a simulated patient built using Patient-Ψ / Patient-Ψ-CM and a therapist LLM sampled from a diverse pool including GPT-o3-mini, Gemini 2.0 Flash, Claude 3.7 Sonnet, DeepSeek R1, PHI 3.5, Llama-4-Scout, and Qwen3-4B-Instruct (Huang et al., 23 Feb 2026). In this implementation, the simulator model is GPT-o3-mini (Huang et al., 23 Feb 2026). The authors report a profile-matching validity check over 40 dialogues with 10 candidate profiles in which a human annotator achieved 100% top-1 accuracy, which they use as evidence that the simulations preserve profile-specific patient characteristics (Huang et al., 23 Feb 2026).

The patient simulator is based on CBT cognitive models with constructs such as core beliefs, automatic thoughts, emotions, and behaviors (Huang et al., 23 Feb 2026). This is important because THERAPYJUDGEBENCH evaluates judges on complete sessions rather than isolated utterances. The judged object is therefore a short therapy trajectory in which fidelity and safety must be inferred from therapist behavior over multiple exchanges (Huang et al., 23 Feb 2026).

The benchmark uses dialogue-level labels rather than turn-level labels because the CTRS is inherently a session-level instrument (Huang et al., 23 Feb 2026). Figure 1 in the source paper is described as showing a 10-turn dialogue paired with dialogue-level annotations from both human and LLM raters, reinforcing that the unit of supervision is the complete session (Huang et al., 23 Feb 2026). This distinguishes THERAPYJUDGEBENCH from turn-level therapy benchmarks such as FAITH-M, which labels individual therapist utterances along six therapeutic principles with ordinal values in $\{-2,-1,0,+1,+2\}$ (Mazhar et al., 7 Apr 2026), and from trust-trajectory benchmarks such as MENTAL-TRUST, where the annotation unit is the patient utterance conditioned on preceding context (Srivastava et al., 6 Jan 2025).

3. Clinical dimensions: CTRS fidelity and therapy-specific safety

THERAPYJUDGEBENCH validates automated evaluation over two clinical pillars: fidelity and safety (Huang et al., 23 Feb 2026). Fidelity is operationalized using the official Cognitive Therapy Rating Scale (CTRS) from the Beck Institute, and each dialogue is scored on 11 CBT skill dimensions on the standard 0–6 scale, where $0$ denotes poor or absent performance, $3$ denotes satisfactory performance, and $6$ denotes excellent or skillful and consistent performance, with odd-numbered values allowed as intermediate scores (Huang et al., 23 Feb 2026).

The 11 CTRS dimensions are Agenda, Feedback, Understanding, Interpersonal Effectiveness, Collaboration, Pacing and Efficient Use of Time, Guided Discovery, Focusing on Key Cognitions or Behaviors, Strategy for Change, Application of CBT Techniques, and Homework (Huang et al., 23 Feb 2026). The benchmark thus treats fidelity as combining both adherence to CBT components and competence in how those components are delivered (Huang et al., 23 Feb 2026).

Safety is captured with four binary session-level labels: provide medical opinion/medication, fail to address crisis and imminent risk, fail to address abuse, and fail to address functional impairment (Huang et al., 23 Feb 2026). These are therapy-specific safety categories derived from the American Mental Health Counselors Association Code of Ethics and prior work on harmful failure modes in LLM mental-health responses, with input from a clinical collaborator who is a CBT specialist (Huang et al., 23 Feb 2026). Safety in THERAPYJUDGEBENCH is therefore not framed as generic toxicity. It is framed as clinically meaningful omission and overreach errors in psychotherapy-like interaction (Huang et al., 23 Feb 2026).

The benchmark’s dimension design can be summarized as follows:

Component	Annotation target	Scale
CTRS fidelity	11 CBT skill dimensions	$0$–$6$
Safety	4 therapy-specific risk categories	binary present/absent
Granularity	complete 10-turn dialogue	dialogue-level

This configuration places THERAPYJUDGEBENCH in a distinct part of the therapy-evaluation landscape. It is broader than single-axis safety rubrics such as psychosis-specific binary criteria for stigmatization, delusion validation, embellishment, or referral failure (Reese et al., 20 Mar 2026), but narrower than seven-dimension response judges such as TheraJudge, which scores Guidance, Informativeness, Relevance, Safety, Empathy, Helpfulness, and Understanding on a 1–5 scale (Rahman et al., 29 Jun 2026). Its main specificity is CBT fidelity at session level (Huang et al., 23 Feb 2026).

4. Expert annotation protocol and reliability

All dialogues were annotated by two licensed CBT-trained practitioners using a custom web-based platform (Huang et al., 23 Feb 2026). Beyond the description “licensed CBT-trained practitioners,” the paper does not provide a more detailed credential breakdown such as years of practice, discipline, or board status (Huang et al., 23 Feb 2026). The same CTRS and safety dimensions are used for both human and LLM rating, enabling direct agreement analysis (Huang et al., 23 Feb 2026).

The benchmark’s reliability analysis is based on duplicate annotation of 20% of the dataset (Huang et al., 23 Feb 2026). For CTRS item scores, the reported human–human interrater reliability metrics are Spearman’s $\rho$ , Pearson’s $r$ , and Krippendorff’s $\alpha$ (ordinal), with explicit emphasis on rank-order consistency because the judge is later used as a noisy shaping reward rather than as an absolute clinical replacement (Huang et al., 23 Feb 2026). Across the 11 CTRS skills, the paper reports average Krippendorff’s $\alpha = 0.52$ , median $0$0, range $0$1–$0$2, average Spearman $0$3, and average Pearson $0$4 (Huang et al., 23 Feb 2026).

The per-skill reliability profile is heterogeneous. Feedback and Pacing and Efficient Use of Time are comparatively strong, with $0$5 and $0$6, respectively, whereas Guided Discovery and Application of CBT Techniques are notably weak, with $0$7 and $0$8 (Huang et al., 23 Feb 2026). The paper then excludes two CTRS dimensions with correlations or agreements below 0.4—specifically Guided Discovery and Application of CBT Techniques—to improve reliability and reward learnability (Huang et al., 23 Feb 2026). This is one of the benchmark’s most operationally important design decisions: THERAPYJUDGEBENCH is not merely descriptive, but selective about which human labels are trusted enough to support automated optimization (Huang et al., 23 Feb 2026).

The paper compares these reliability figures to the original CTRS literature, citing a reliability coefficient of 0.59 for human CTRS ratings in the original study, and argues that the observed reliability is broadly comparable (Huang et al., 23 Feb 2026). A plausible implication is that THERAPYJUDGEBENCH is designed less as a source of immutable gold labels than as a calibrated supervision substrate whose trustworthy dimensions are explicitly filtered.

5. Human–LLM judge alignment and calibration of TherapyJudge

THERAPYJUDGEBENCH is used to evaluate candidate LLM judges against clinician annotations on CTRS item scores (Huang et al., 23 Feb 2026). The benchmark compares three judge models—Claude 3.7, DeepSeek R1, and o3-mini—under two prompt regimes: zero-shot rubric-only and ICL skill usage example, where the latter includes skill definitions and illustrative examples of the CTRS skills (Huang et al., 23 Feb 2026). A few-shot prompt using example dialogues paired with human ratings was also tried, but performed substantially worse and was dropped from the main analysis (Huang et al., 23 Feb 2026).

The primary metric for human–LLM alignment is Spearman’s $0$9, because the intended use case values preference alignment and rank-order recovery more than exact numeric concordance (Huang et al., 23 Feb 2026). The best reported configuration is Claude 3.7 with ICL, which reaches average Spearman $3$0 across the retained CTRS skills (Huang et al., 23 Feb 2026). Claude 3.7 zero-shot scores 0.51, DeepSeek R1 zero-shot 0.48, DeepSeek R1 ICL 0.52, o3-mini zero-shot 0.44, and o3-mini ICL 0.44 (Huang et al., 23 Feb 2026). Few-shot prompting is much worse, with Claude 3.7 at 0.24 and o3-mini at 0.22 (Huang et al., 23 Feb 2026).

Dimension-level alignment is uneven. For Claude 3.7 ICL, per-dimension values include Agenda 0.30, Feedback 0.52, Understanding 0.55, Interpersonal 0.52, Collaboration 0.67, Pacing 0.65, Focus 0.53, Strategy 0.67, and Homework 0.59 (Huang et al., 23 Feb 2026). This unevenness is important because it shows that “average agreement” conceals specific weaknesses, especially on dimensions already known to be difficult for humans (Huang et al., 23 Feb 2026).

Safety alignment is summarized more coarsely: TherapyJudge achieves 99% accuracy relative to expert annotations on the four safety labels (Huang et al., 23 Feb 2026). The paper does not provide per-category precision, recall, F1, AUROC, or a confusion matrix for safety, and does not specify whether the 99% statistic is macro-averaged, micro-averaged, or simple overall label accuracy (Huang et al., 23 Feb 2026). This makes the safety validation encouraging but under-specified.

The benchmark is also used as a prompt audit. The reported conclusion is that ICL with skill definitions and examples improves alignment for Claude and DeepSeek, while few-shot exemplar prompting degrades it, possibly due to prompt dilution or context-length effects (Huang et al., 23 Feb 2026). THERAPYJUDGEBENCH therefore functions not only as a model benchmark, but as a prompt-design benchmark for therapy judges (Huang et al., 23 Feb 2026).

This emphasis on judge calibration aligns with broader concerns in therapy-oriented evaluation. CounselBench shows that off-the-shelf LLM judges overrate counseling responses, miss safety failures, and can invert human rankings in single-turn mental-health counseling (Li et al., 10 Jun 2025). A psychometric analysis of general LLM judges argues more broadly that a judge should be treated as a measurement instrument, with explicit profiling of dark current, positional false preference, stable cross-sensitivity, target sensitivity, and criterion effects before downstream claims are made (Usami et al., 14 Jun 2026). THERAPYJUDGEBENCH does not adopt that metrological framework, but its calibration logic is compatible with it (Huang et al., 23 Feb 2026).

6. Downstream use, reward construction, and limitations

After model and prompt selection, the chosen judge is frozen as TherapyJudge and used as a reward model inside TherapyGym RL training (Huang et al., 23 Feb 2026). For retained skills $3$1, raw session-level CTRS scores are normalized as

$3$2

The dialogue reward is then defined as

$3$3

where $3$4 after excluding two unreliable CTRS dimensions, $3$5 are optional skill weights, $3$6 are penalty coefficients for the four safety labels, and $3$7 indicates presence of safety violation $3$8 (Huang et al., 23 Feb 2026). In RL, this scalar session reward is standardized within groups of trajectories generated for the same patient profile and used within a GRPO objective (Huang et al., 23 Feb 2026).

The benchmark thereby directly influences what is optimized. Using TherapyJudge as the RL reward, the paper reports substantial downstream gains for Qwen3-4B: under human judgment, average CTRS skill score rises from 0.10 to 0.60 and average safety violation rate falls from 0.38 to 0.20; under LLM judgment, average CTRS rises from 0.16 to 0.59 and safety violation rate falls from 0.38 to 0.13 (Huang et al., 23 Feb 2026). A safety-penalty ablation shows that training without the safety penalty yields CBT score 0.53 and safety violations 0.43, versus CBT 0.59 and safety 0.13 when the penalty is included (Huang et al., 23 Feb 2026). This suggests that the benchmark’s safety labels are not merely diagnostic; they materially shape optimized policy behavior.

Several misconceptions are addressed implicitly by the benchmark design. THERAPYJUDGEBENCH is not a leaderboard for therapist policies, not a general psychotherapy benchmark, and not a clinician replacement (Huang et al., 23 Feb 2026). The paper explicitly states that the selected judge should be interpreted as a noisy shaping reward, while blinded clinician ratings remain necessary for final evaluation (Huang et al., 23 Feb 2026). It is also CBT-specific: fidelity is measured only through CTRS, and the conclusion explicitly notes future expansion to approaches such as ACT or DBT (Huang et al., 23 Feb 2026).

Its limitations are substantial and explicitly acknowledged. The dialogues are synthetic rather than real patient–therapist sessions, although the simulator is validated for profile specificity (Huang et al., 23 Feb 2026). The benchmark is relatively small at 116 dialogues (Huang et al., 23 Feb 2026). Clinician agreement is moderate rather than high-perfect, and two dimensions are dropped from optimization because of low reliability (Huang et al., 23 Feb 2026). Safety validation is summarized only by a top-line 99% accuracy without detailed error analysis (Huang et al., 23 Feb 2026). The paper does not define a benchmark train/dev/test split for THERAPYJUDGEBENCH itself, specifying only that 20% was double-annotated for interrater reliability and the remaining dialogues were singly annotated (Huang et al., 23 Feb 2026).

A broader implication is that THERAPYJUDGEBENCH is best treated as a judge-auditing benchmark for multi-turn CBT sessions, not as a sufficient basis for claims about real-world therapeutic efficacy. This interpretation is reinforced by related work. PAIR-SAFE shows that clinically grounded judges can improve runtime support quality in motivational interviewing-style dialogue, but also that rubric optimization can over-prioritize linguistic smoothing at the expense of change-talk cultivation (Kim et al., 19 Jan 2026). TheraJudge and TheraAgent show that therapeutic evaluators can achieve strong clinician agreement on seven mental-health response dimensions and drive targeted repair of low-quality outputs, but also reveal the importance of dimension-level validation and threshold-sensitive analysis (Rahman et al., 29 Jun 2026). THERAPYJUDGEBENCH occupies the CBT session-level end of this emerging judge-evaluation spectrum (Huang et al., 23 Feb 2026).