Papers
Topics
Authors
Recent
Search
2000 character limit reached

MentraBench: Evaluating Mental-Health LLMs

Updated 4 July 2026
  • MentraBench is a domain-specific benchmark that assesses mental-health reasoning by evaluating appraisal, diagnosis, intervention, abstraction, and verification in LLMs.
  • It integrates six tasks and 13 datasets, employing metrics like micro-F1, macro-F1, Jaccard similarity, and recall to ensure robust, process-sensitive evaluations.
  • The benchmark supports post-training optimization for models like Mindora by emphasizing concise, coherent, and clinically aligned multi-step reasoning.

Searching arXiv for the benchmark and closely related papers to ground the article. MentraBench is a benchmark within the MentraSuite framework for evaluating LLMs on mental-health reasoning and assessment, with emphasis on step-wise, clinically aligned reasoning rather than outcome accuracy alone. It is designed to test whether models can perform appraisal, diagnosis, intervention selection, abstraction, and verification in mental-health settings, while also maintaining concise, coherent, non-hallucinated, and internally consistent reasoning trajectories (Xiao et al., 10 Dec 2025). The benchmark spans six tasks and 13 datasets, combining classification, question answering, summarization, and misinformation detection, and it serves both as an evaluation suite and as a source of training signals for Mindora, the post-trained model introduced alongside it (Xiao et al., 10 Dec 2025).

1. Definition and scope

MentraBench is the benchmark component of MentraSuite, a unified framework introduced to advance reliable mental-health reasoning in LLMs (Xiao et al., 10 Dec 2025). The benchmark is motivated by the claim that existing psychological LLM evaluations often emphasize emotional understanding, knowledge recall, or narrowly defined reasoning problems, while underemphasizing the clinically aligned reasoning chain required for appraisal, diagnosis, intervention planning, abstraction, and verification (Xiao et al., 10 Dec 2025).

Within this formulation, MentraBench is not limited to measuring whether a model returns a correct final answer. It also evaluates whether the model’s reasoning process is concise, logically coherent, free of hallucinated content, aligned with the task instruction, and internally consistent (Xiao et al., 10 Dec 2025). This makes the benchmark explicitly process-sensitive in a way that differs from purely label-based evaluations.

The benchmark is organized around five core reasoning aspects and six tasks. The five aspects named in the source are appraisal, diagnosis, intervention, abstraction, and verification, while the task list also includes a distinct multi-step clinical reasoning component instantiated as psychiatry question answering (Xiao et al., 10 Dec 2025). This suggests that the benchmark’s conceptual structure is clinically grounded, whereas its task structure separates multi-step reasoning as an operational evaluation category.

A potential source of confusion is the similarity between MentraBench and metabench. The latter, introduced in a separate work, is a sparse benchmark for reasoning and knowledge in general-purpose LLM evaluation across ARC, GSM8K, HellaSwag, MMLU, TruthfulQA, and WinoGrande (Kipnis et al., 2024). MentraBench, by contrast, is domain-specific and centered on mental-health reasoning, counseling, psychiatry QA, evidence abstraction, and misinformation detection (Xiao et al., 10 Dec 2025).

2. Task organization and dataset composition

MentraBench includes six tasks aligned with the benchmark’s clinical reasoning goals (Xiao et al., 10 Dec 2025). These are Cognitive Error Identification, Mental Health Condition Detection, Counseling Strategy Formulation, Psychiatry QA, Psychiatry Systematic Review Summarization, and Mental Health Misinformation Identification (Xiao et al., 10 Dec 2025). Each task is associated with one or more datasets drawn from existing sources, processed corpora, or newly curated material.

The benchmark spans 13 datasets in the paper’s accounting, although the detailed enumeration includes 14 numbered entries because MentalMisinfo is described as “counted as one dataset in table” (Xiao et al., 10 Dec 2025). The datasets cover synthetic negative thoughts, therapist-client dialogues, Reddit and Twitter posts, motivational interviewing transcripts, psychiatry subsets of medical QA resources, Cochrane psychiatric systematic review abstracts, and transcripts of mental-health-related videos from video-sharing platforms (Xiao et al., 10 Dec 2025).

Task block Datasets named in the paper Metric
Appraisal / Cognitive Error Identification CognitiveReframing, PatternReframe, Therapist QA Micro-F1
Diagnosis / Mental Health Condition Detection DepSign, SWMH, T-SID Micro-F1
Intervention / Counseling Strategy Formulation PsyDTCorpusM_M, AnnoMIM_M Jaccard similarity
Multi-step Clinical Reasoning / Psychiatry QA MHQA, MedQAM_M, MedMCQAM_M, PubMedQAM_M Micro-F1
Abstraction / Psychiatry Systematic Review Summarization PSRS* Recall
Verification / Mental-Health Misinformation Detection MentalMisinfo Macro-F1

The appraisal component focuses on cognitive distortion classification. CognitiveReframing is based on synthetic negative thoughts derived from the Thought Records Dataset and web self-reports from Mental Health America’s screening site; PatternReframe uses synthetic statements built from PERSONA-CHAT personas; Therapist QA uses real therapist-client dialogues (Xiao et al., 10 Dec 2025). These datasets differ in realism and label taxonomies, but the prompts are aligned to each dataset’s original definitions (Xiao et al., 10 Dec 2025).

The diagnosis component uses weakly supervised social-media datasets: DepSign from Reddit, SWMH from Reddit, and T-SID from Twitter (Xiao et al., 10 Dec 2025). These tasks require the model to infer the presence or type of mental-health condition from noisy, real-world text, including suicidality in some cases (Xiao et al., 10 Dec 2025).

The intervention component centers on counseling strategy formulation. PsyDTCorpusM_M is based on PsyDTCorpus and is used only for training, not evaluation, because of its partially synthetic nature (Xiao et al., 10 Dec 2025). AnnoMIM_M is derived from real motivational interviewing demonstration videos and is used for evaluation only; an expert counselor with 10+ years experience validates constructed items (Xiao et al., 10 Dec 2025).

The multi-step clinical reasoning component aggregates psychiatry-focused question answering from MHQA and psychiatry subsets of MedQA, MedMCQA, and PubMedQA (Xiao et al., 10 Dec 2025). The abstraction component is PSRS*, a newly constructed dataset from Cochrane Library psychiatric systematic reviews, where expert annotators produce “scoring points” that capture effect direction, certainty, population, and intervention (Xiao et al., 10 Dec 2025). The verification component uses MentalMisinfo, which contains transcripts of mental-health content from YouTube Shorts and BitChute labeled as accurate or misleading (Xiao et al., 10 Dec 2025).

3. Reasoning aspects and evaluation dimensions

MentraBench’s central claim is that mental-health reasoning should be evaluated across clinically meaningful stages. The benchmark defines appraisal as cognitive-pattern reasoning, diagnosis as mental-condition reasoning, intervention as therapeutic-action reasoning, abstraction as evidence-based reasoning, and verification as misinformation detection, while also isolating multi-step clinical reasoning as an end-to-end psychiatry QA task (Xiao et al., 10 Dec 2025).

In appraisal, the model must identify the kind of cognitive error present in a client’s thought, including distinctions among categories such as all-or-nothing thinking, mind reading, catastrophizing, should statements, emotional reasoning, comparing and despairing, blaming, negative feeling or emotion, and discounting the positive, depending on dataset-specific taxonomies (Xiao et al., 10 Dec 2025). In diagnosis, the model must infer mental condition labels from social-media posts while resisting overpathologization and differentiating overlapping symptom profiles (Xiao et al., 10 Dec 2025). In intervention, the model must choose a counseling strategy such as clarification, reflection of feeling, or confrontation, rather than merely producing supportive dialogue (Xiao et al., 10 Dec 2025).

Beyond task categories, MentraBench explicitly evaluates reasoning trajectory quality along five dimensions: reasoning conciseness, logical coherence, hallucination avoidance, task understanding, and internal consistency (Xiao et al., 10 Dec 2025). Conciseness is defined as avoiding unnecessary complexity, repetition, or unproductive backtracking. Logical coherence requires clear, case-specific reasoning rather than headings without substantive elaboration or unsupported assertions. Hallucination avoidance requires that the reasoning reflect only information in the input case. Task understanding requires adherence to the intended task, for example labeling a counseling strategy rather than generating a counselor reply. Internal consistency requires that no later step contradict an earlier interpretation or conclusion (Xiao et al., 10 Dec 2025).

These dimensions are scored manually on sampled reasoning chains using binary labels, with one point assigned when no issue is present and zero otherwise (Xiao et al., 10 Dec 2025). For each dataset and each selected model, annotators sample four cases: two where all models answer correctly and two where all models answer incorrectly (Xiao et al., 10 Dec 2025). The per-model reasoning trajectory score is the average across the five dimensions,

Ravg=15k=15Rk,R_{\text{avg}} = \frac{1}{5}\sum_{k=1}^{5} R_k,

where Rk{0,1}R_k \in \{0,1\} (Xiao et al., 10 Dec 2025).

The paper reports high inter-annotator agreement for these reasoning metrics, with Gwet AC1 approximately $0.96$–M_M0 and Cohen’s M_M1 approximately M_M2–M_M3, depending on dimension (Xiao et al., 10 Dec 2025). This supports the use of these trajectory-level annotations as a distinct evaluation layer rather than a purely illustrative supplement.

4. Metrics, prompting protocol, and scoring procedures

MentraBench uses different metrics depending on task structure (Xiao et al., 10 Dec 2025). Micro-F1 is used for the cognitive distortion datasets, the condition-detection datasets, and the psychiatry QA datasets. Macro-F1 is used for MentalMisinfo to equally weight accurate and misleading classes. Jaccard similarity is used for PsyDTCorpusM_M4 and AnnoMIM_M5, which are treated as multi-label counseling strategy tasks. PSRS uses recall over annotated scoring points (Xiao et al., 10 Dec 2025).

The Jaccard metric is defined as

M_M6

where M_M7 is the set of predicted strategy labels and M_M8 the ground-truth labels (Xiao et al., 10 Dec 2025). For PSRS, if M_M9 is the set of gold scoring points and M_M0 the subset covered by the model’s summary, recall is

M_M1

This scoring method is intended to capture whether the summary covers key findings such as effect direction and certainty, rather than only lexical similarity (Xiao et al., 10 Dec 2025).

The evaluation protocol standardizes model outputs into a two-phase structure. A reasoning phase is enclosed within > ..., and an answer phase is enclosed within <answer> ... </answer> and ends with Answer: [option/result] (Xiao et al., 10 Dec 2025). This format is imposed even on black-box API models so that reasoning structures are directly comparable across model families (Xiao et al., 10 Dec 2025).

The benchmark is evaluated in what appears to be a zero-shot setting with standardized system prompts (Xiao et al., 10 Dec 2025). Closed-form classification and multiple-choice tasks are scored automatically against gold labels. Open-ended summarization in PSRS is scored through coverage of annotated scoring points, described at a high level as manual or rule-based checking (Xiao et al., 10 Dec 2025). This suggests that MentraBench combines automated and annotation-dependent evaluation, depending on whether the task is verifiable through discrete outputs.

5. Data construction, annotation, and relation to MentraSuite training

MentraBench integrates existing datasets, processed datasets, and one newly constructed dataset (Xiao et al., 10 Dec 2025). The processed resources include PsyDTCorpusM_M2, AnnoMIM_M3, MedQAM_M4, MedMCQAM_M5, and PubMedQAM_M6, while PSRS* is newly curated from Cochrane psychiatric systematic reviews (Xiao et al., 10 Dec 2025).

For counseling strategy tasks, GPT-4o is used to compress client utterances into concise case summaries and strategy labels are extracted from counselor utterances, after which all items are manually reviewed and corrected by an expert counselor (Xiao et al., 10 Dec 2025). For PSRS, expert annotators identify and list scoring points for each abstract (Xiao et al., 10 Dec 2025). Cognitive distortion datasets rely on annotations by mental-health professionals or multiple raters, sometimes up to 15, whereas social-media condition datasets use weak supervision and are acknowledged to be noisier (Xiao et al., 10 Dec 2025).

The benchmark is tightly connected to MentraSuite’s training pipeline. Training splits of MentraBench datasets are used in a difficulty-filtering step in which Llama-3-8B-Instruct is run in zero-shot mode and only cases answered incorrectly are retained (Xiao et al., 10 Dec 2025). These retained cases are then processed through an iterative optimal reasoning path search with GPT-4o, involving backtracking, exploration of new paths, verification, and correction, with up to M_M7 iterations and up to M_M8 restarts (Xiao et al., 10 Dec 2025). Accepted trajectories are rewritten into a strict two-phase structure with headings such as ### Symptom Analysis, ### Differential Diagnosis, ### Risk Assessment, and ### Final Conclusion (Xiao et al., 10 Dec 2025).

These trajectories are used in supervised fine-tuning and reinforcement learning for Mindora. The reward function includes indicators for valid format, valid length, and consistency, multiplied by a task-specific quality term (Xiao et al., 10 Dec 2025):

M_M9

The consistency signal is checked by an auxiliary model M_M0 identified as Qwen3-32B, and the valid reasoning length range is M_M1 with M_M2 and M_M3 (Xiao et al., 10 Dec 2025).

This design means that MentraBench functions as both an evaluation artifact and a training-aligned specification. A plausible implication is that the benchmark was constructed not only to rank models, but also to formalize which properties of mental-health reasoning should be optimized during post-training.

6. Empirical results, comparisons, and limitations

MentraBench is used to evaluate 20 LLMs, including closed-source models such as GPT-o4-mini, GPT-4o, DeepSeek-R1, DeepSeek-V3, Qwen-plus, and QwQ-plus; open-source models such as LLaMA-4, LLaMA-3.3-70B, Qwen2.5-72B, Qwen3-32B, QwQ-32B, Qwen3-14B, LLaMA 3.1-8B, and Qwen3-8B; specialized systems such as EmoLLM and Psyche-R1; and the MentraSuite models MindoraM_M4, MindoraM_M5, and MindoraM_M6 (Xiao et al., 10 Dec 2025).

The main aggregate result reported is that MindoraM_M7 achieves the highest overall average score, M_M8, across all 13 datasets (Xiao et al., 10 Dec 2025). The same section reports comparison points including GPT-o4-mini at approximately M_M9, DeepSeek-R1 at approximately M_M0, GPT-4o at approximately M_M1, Qwen-plus at approximately M_M2, Psyche-R1 at approximately M_M3, and the Qwen3-8B backbone at approximately M_M4 (Xiao et al., 10 Dec 2025). MindoraM_M5 also improves substantially over Qwen3-8B, from M_M6 to M_M7, which the paper interprets as evidence for the utility of reinforcement learning with a consistency-oriented reward (Xiao et al., 10 Dec 2025).

On reasoning-trajectory evaluation, the paper states that MindoraM_M8 achieves M_M9, M_M0, M_M1, M_M2, M_M3, and M_M4 (Xiao et al., 10 Dec 2025). The same section compares these numbers with GPT-o4-mini at M_M5, Qwen-plus at M_M6, LLaMA-4 at M_M7, Psyche-R1 at M_M8, and Qwen3-8B at M_M9, while noting that DeepSeek-R1 reaches M_M0 and is described as “very high,” though Mindora is said to be more balanced on certain dimensions (Xiao et al., 10 Dec 2025).

The benchmark is compared with Psyche-R1, Psy-Interpreter, and PsychCounsel-Bench. According to the comparison table described in the source, Psyche-R1 covers 3 tasks and 4 datasets, Psy-Interpreter covers 3 tasks and 6 datasets, PsychCounsel-Bench covers 1 task and 1 dataset, whereas MentraBench covers 6 tasks and 13 datasets (Xiao et al., 10 Dec 2025). It is also described as the only benchmark among those compared that explicitly covers appraisal, diagnosis, intervention, abstraction, verification, and multi-step reasoning while additionally evaluating reasoning chains along five qualitative dimensions (Xiao et al., 10 Dec 2025).

The limitations discussed are domain and data related. Many datasets are English-only and tied to specific platforms such as Reddit, Twitter, and YouTube; weak supervision introduces label noise in condition-detection tasks; the benchmark is text-only and excludes multimodal cues such as voice tone and facial expression; and high benchmark performance is not treated as guaranteeing safety in actual therapy, since models lack ethical accountability and emotional intelligence (Xiao et al., 10 Dec 2025). The paper also notes that misuse of LLMs in mental-health roles can amplify anxiety or misinformation if reasoning is incomplete or ungrounded (Xiao et al., 10 Dec 2025).

These limitations delimit the benchmark’s intended interpretation. MentraBench measures performance and reasoning quality on a broad set of mental-health tasks, but it is not presented as a certification of safe clinical deployment. This suggests that its primary role is methodological: to provide a more comprehensive and clinically structured basis for evaluating and post-training mental-health LLMs than prior benchmarks offered (Xiao et al., 10 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MentraBench.