MentraBench: Evaluating Mental-Health LLMs
- MentraBench is a domain-specific benchmark that assesses mental-health reasoning by evaluating appraisal, diagnosis, intervention, abstraction, and verification in LLMs.
- It integrates six tasks and 13 datasets, employing metrics like micro-F1, macro-F1, Jaccard similarity, and recall to ensure robust, process-sensitive evaluations.
- The benchmark supports post-training optimization for models like Mindora by emphasizing concise, coherent, and clinically aligned multi-step reasoning.
Searching arXiv for the benchmark and closely related papers to ground the article. MentraBench is a benchmark within the MentraSuite framework for evaluating LLMs on mental-health reasoning and assessment, with emphasis on step-wise, clinically aligned reasoning rather than outcome accuracy alone. It is designed to test whether models can perform appraisal, diagnosis, intervention selection, abstraction, and verification in mental-health settings, while also maintaining concise, coherent, non-hallucinated, and internally consistent reasoning trajectories (Xiao et al., 10 Dec 2025). The benchmark spans six tasks and 13 datasets, combining classification, question answering, summarization, and misinformation detection, and it serves both as an evaluation suite and as a source of training signals for Mindora, the post-trained model introduced alongside it (Xiao et al., 10 Dec 2025).
1. Definition and scope
MentraBench is the benchmark component of MentraSuite, a unified framework introduced to advance reliable mental-health reasoning in LLMs (Xiao et al., 10 Dec 2025). The benchmark is motivated by the claim that existing psychological LLM evaluations often emphasize emotional understanding, knowledge recall, or narrowly defined reasoning problems, while underemphasizing the clinically aligned reasoning chain required for appraisal, diagnosis, intervention planning, abstraction, and verification (Xiao et al., 10 Dec 2025).
Within this formulation, MentraBench is not limited to measuring whether a model returns a correct final answer. It also evaluates whether the model’s reasoning process is concise, logically coherent, free of hallucinated content, aligned with the task instruction, and internally consistent (Xiao et al., 10 Dec 2025). This makes the benchmark explicitly process-sensitive in a way that differs from purely label-based evaluations.
The benchmark is organized around five core reasoning aspects and six tasks. The five aspects named in the source are appraisal, diagnosis, intervention, abstraction, and verification, while the task list also includes a distinct multi-step clinical reasoning component instantiated as psychiatry question answering (Xiao et al., 10 Dec 2025). This suggests that the benchmark’s conceptual structure is clinically grounded, whereas its task structure separates multi-step reasoning as an operational evaluation category.
A potential source of confusion is the similarity between MentraBench and metabench. The latter, introduced in a separate work, is a sparse benchmark for reasoning and knowledge in general-purpose LLM evaluation across ARC, GSM8K, HellaSwag, MMLU, TruthfulQA, and WinoGrande (Kipnis et al., 2024). MentraBench, by contrast, is domain-specific and centered on mental-health reasoning, counseling, psychiatry QA, evidence abstraction, and misinformation detection (Xiao et al., 10 Dec 2025).
2. Task organization and dataset composition
MentraBench includes six tasks aligned with the benchmark’s clinical reasoning goals (Xiao et al., 10 Dec 2025). These are Cognitive Error Identification, Mental Health Condition Detection, Counseling Strategy Formulation, Psychiatry QA, Psychiatry Systematic Review Summarization, and Mental Health Misinformation Identification (Xiao et al., 10 Dec 2025). Each task is associated with one or more datasets drawn from existing sources, processed corpora, or newly curated material.
The benchmark spans 13 datasets in the paper’s accounting, although the detailed enumeration includes 14 numbered entries because MentalMisinfo is described as “counted as one dataset in table” (Xiao et al., 10 Dec 2025). The datasets cover synthetic negative thoughts, therapist-client dialogues, Reddit and Twitter posts, motivational interviewing transcripts, psychiatry subsets of medical QA resources, Cochrane psychiatric systematic review abstracts, and transcripts of mental-health-related videos from video-sharing platforms (Xiao et al., 10 Dec 2025).
| Task block | Datasets named in the paper | Metric |
|---|---|---|
| Appraisal / Cognitive Error Identification | CognitiveReframing, PatternReframe, Therapist QA | Micro-F1 |
| Diagnosis / Mental Health Condition Detection | DepSign, SWMH, T-SID | Micro-F1 |
| Intervention / Counseling Strategy Formulation | PsyDTCorpus, AnnoMI | Jaccard similarity |
| Multi-step Clinical Reasoning / Psychiatry QA | MHQA, MedQA, MedMCQA, PubMedQA | Micro-F1 |
| Abstraction / Psychiatry Systematic Review Summarization | PSRS* | Recall |
| Verification / Mental-Health Misinformation Detection | MentalMisinfo | Macro-F1 |
The appraisal component focuses on cognitive distortion classification. CognitiveReframing is based on synthetic negative thoughts derived from the Thought Records Dataset and web self-reports from Mental Health America’s screening site; PatternReframe uses synthetic statements built from PERSONA-CHAT personas; Therapist QA uses real therapist-client dialogues (Xiao et al., 10 Dec 2025). These datasets differ in realism and label taxonomies, but the prompts are aligned to each dataset’s original definitions (Xiao et al., 10 Dec 2025).
The diagnosis component uses weakly supervised social-media datasets: DepSign from Reddit, SWMH from Reddit, and T-SID from Twitter (Xiao et al., 10 Dec 2025). These tasks require the model to infer the presence or type of mental-health condition from noisy, real-world text, including suicidality in some cases (Xiao et al., 10 Dec 2025).
The intervention component centers on counseling strategy formulation. PsyDTCorpus is based on PsyDTCorpus and is used only for training, not evaluation, because of its partially synthetic nature (Xiao et al., 10 Dec 2025). AnnoMI is derived from real motivational interviewing demonstration videos and is used for evaluation only; an expert counselor with 10+ years experience validates constructed items (Xiao et al., 10 Dec 2025).
The multi-step clinical reasoning component aggregates psychiatry-focused question answering from MHQA and psychiatry subsets of MedQA, MedMCQA, and PubMedQA (Xiao et al., 10 Dec 2025). The abstraction component is PSRS*, a newly constructed dataset from Cochrane Library psychiatric systematic reviews, where expert annotators produce “scoring points” that capture effect direction, certainty, population, and intervention (Xiao et al., 10 Dec 2025). The verification component uses MentalMisinfo, which contains transcripts of mental-health content from YouTube Shorts and BitChute labeled as accurate or misleading (Xiao et al., 10 Dec 2025).
3. Reasoning aspects and evaluation dimensions
MentraBench’s central claim is that mental-health reasoning should be evaluated across clinically meaningful stages. The benchmark defines appraisal as cognitive-pattern reasoning, diagnosis as mental-condition reasoning, intervention as therapeutic-action reasoning, abstraction as evidence-based reasoning, and verification as misinformation detection, while also isolating multi-step clinical reasoning as an end-to-end psychiatry QA task (Xiao et al., 10 Dec 2025).
In appraisal, the model must identify the kind of cognitive error present in a client’s thought, including distinctions among categories such as all-or-nothing thinking, mind reading, catastrophizing, should statements, emotional reasoning, comparing and despairing, blaming, negative feeling or emotion, and discounting the positive, depending on dataset-specific taxonomies (Xiao et al., 10 Dec 2025). In diagnosis, the model must infer mental condition labels from social-media posts while resisting overpathologization and differentiating overlapping symptom profiles (Xiao et al., 10 Dec 2025). In intervention, the model must choose a counseling strategy such as clarification, reflection of feeling, or confrontation, rather than merely producing supportive dialogue (Xiao et al., 10 Dec 2025).
Beyond task categories, MentraBench explicitly evaluates reasoning trajectory quality along five dimensions: reasoning conciseness, logical coherence, hallucination avoidance, task understanding, and internal consistency (Xiao et al., 10 Dec 2025). Conciseness is defined as avoiding unnecessary complexity, repetition, or unproductive backtracking. Logical coherence requires clear, case-specific reasoning rather than headings without substantive elaboration or unsupported assertions. Hallucination avoidance requires that the reasoning reflect only information in the input case. Task understanding requires adherence to the intended task, for example labeling a counseling strategy rather than generating a counselor reply. Internal consistency requires that no later step contradict an earlier interpretation or conclusion (Xiao et al., 10 Dec 2025).
These dimensions are scored manually on sampled reasoning chains using binary labels, with one point assigned when no issue is present and zero otherwise (Xiao et al., 10 Dec 2025). For each dataset and each selected model, annotators sample four cases: two where all models answer correctly and two where all models answer incorrectly (Xiao et al., 10 Dec 2025). The per-model reasoning trajectory score is the average across the five dimensions,
where (Xiao et al., 10 Dec 2025).
The paper reports high inter-annotator agreement for these reasoning metrics, with Gwet AC1 approximately $0.96$–0 and Cohen’s 1 approximately 2–3, depending on dimension (Xiao et al., 10 Dec 2025). This supports the use of these trajectory-level annotations as a distinct evaluation layer rather than a purely illustrative supplement.
4. Metrics, prompting protocol, and scoring procedures
MentraBench uses different metrics depending on task structure (Xiao et al., 10 Dec 2025). Micro-F1 is used for the cognitive distortion datasets, the condition-detection datasets, and the psychiatry QA datasets. Macro-F1 is used for MentalMisinfo to equally weight accurate and misleading classes. Jaccard similarity is used for PsyDTCorpus4 and AnnoMI5, which are treated as multi-label counseling strategy tasks. PSRS uses recall over annotated scoring points (Xiao et al., 10 Dec 2025).
The Jaccard metric is defined as
6
where 7 is the set of predicted strategy labels and 8 the ground-truth labels (Xiao et al., 10 Dec 2025). For PSRS, if 9 is the set of gold scoring points and 0 the subset covered by the model’s summary, recall is
1
This scoring method is intended to capture whether the summary covers key findings such as effect direction and certainty, rather than only lexical similarity (Xiao et al., 10 Dec 2025).
The evaluation protocol standardizes model outputs into a two-phase structure. A reasoning phase is enclosed within > ..., and an answer phase is enclosed within <answer> ... </answer> and ends with Answer: [option/result] (Xiao et al., 10 Dec 2025). This format is imposed even on black-box API models so that reasoning structures are directly comparable across model families (Xiao et al., 10 Dec 2025).
The benchmark is evaluated in what appears to be a zero-shot setting with standardized system prompts (Xiao et al., 10 Dec 2025). Closed-form classification and multiple-choice tasks are scored automatically against gold labels. Open-ended summarization in PSRS is scored through coverage of annotated scoring points, described at a high level as manual or rule-based checking (Xiao et al., 10 Dec 2025). This suggests that MentraBench combines automated and annotation-dependent evaluation, depending on whether the task is verifiable through discrete outputs.
5. Data construction, annotation, and relation to MentraSuite training
MentraBench integrates existing datasets, processed datasets, and one newly constructed dataset (Xiao et al., 10 Dec 2025). The processed resources include PsyDTCorpus2, AnnoMI3, MedQA4, MedMCQA5, and PubMedQA6, while PSRS* is newly curated from Cochrane psychiatric systematic reviews (Xiao et al., 10 Dec 2025).
For counseling strategy tasks, GPT-4o is used to compress client utterances into concise case summaries and strategy labels are extracted from counselor utterances, after which all items are manually reviewed and corrected by an expert counselor (Xiao et al., 10 Dec 2025). For PSRS, expert annotators identify and list scoring points for each abstract (Xiao et al., 10 Dec 2025). Cognitive distortion datasets rely on annotations by mental-health professionals or multiple raters, sometimes up to 15, whereas social-media condition datasets use weak supervision and are acknowledged to be noisier (Xiao et al., 10 Dec 2025).
The benchmark is tightly connected to MentraSuite’s training pipeline. Training splits of MentraBench datasets are used in a difficulty-filtering step in which Llama-3-8B-Instruct is run in zero-shot mode and only cases answered incorrectly are retained (Xiao et al., 10 Dec 2025). These retained cases are then processed through an iterative optimal reasoning path search with GPT-4o, involving backtracking, exploration of new paths, verification, and correction, with up to 7 iterations and up to 8 restarts (Xiao et al., 10 Dec 2025). Accepted trajectories are rewritten into a strict two-phase structure with headings such as ### Symptom Analysis, ### Differential Diagnosis, ### Risk Assessment, and ### Final Conclusion (Xiao et al., 10 Dec 2025).
These trajectories are used in supervised fine-tuning and reinforcement learning for Mindora. The reward function includes indicators for valid format, valid length, and consistency, multiplied by a task-specific quality term (Xiao et al., 10 Dec 2025):
9
The consistency signal is checked by an auxiliary model 0 identified as Qwen3-32B, and the valid reasoning length range is 1 with 2 and 3 (Xiao et al., 10 Dec 2025).
This design means that MentraBench functions as both an evaluation artifact and a training-aligned specification. A plausible implication is that the benchmark was constructed not only to rank models, but also to formalize which properties of mental-health reasoning should be optimized during post-training.
6. Empirical results, comparisons, and limitations
MentraBench is used to evaluate 20 LLMs, including closed-source models such as GPT-o4-mini, GPT-4o, DeepSeek-R1, DeepSeek-V3, Qwen-plus, and QwQ-plus; open-source models such as LLaMA-4, LLaMA-3.3-70B, Qwen2.5-72B, Qwen3-32B, QwQ-32B, Qwen3-14B, LLaMA 3.1-8B, and Qwen3-8B; specialized systems such as EmoLLM and Psyche-R1; and the MentraSuite models Mindora4, Mindora5, and Mindora6 (Xiao et al., 10 Dec 2025).
The main aggregate result reported is that Mindora7 achieves the highest overall average score, 8, across all 13 datasets (Xiao et al., 10 Dec 2025). The same section reports comparison points including GPT-o4-mini at approximately 9, DeepSeek-R1 at approximately 0, GPT-4o at approximately 1, Qwen-plus at approximately 2, Psyche-R1 at approximately 3, and the Qwen3-8B backbone at approximately 4 (Xiao et al., 10 Dec 2025). Mindora5 also improves substantially over Qwen3-8B, from 6 to 7, which the paper interprets as evidence for the utility of reinforcement learning with a consistency-oriented reward (Xiao et al., 10 Dec 2025).
On reasoning-trajectory evaluation, the paper states that Mindora8 achieves 9, 0, 1, 2, 3, and 4 (Xiao et al., 10 Dec 2025). The same section compares these numbers with GPT-o4-mini at 5, Qwen-plus at 6, LLaMA-4 at 7, Psyche-R1 at 8, and Qwen3-8B at 9, while noting that DeepSeek-R1 reaches 0 and is described as “very high,” though Mindora is said to be more balanced on certain dimensions (Xiao et al., 10 Dec 2025).
The benchmark is compared with Psyche-R1, Psy-Interpreter, and PsychCounsel-Bench. According to the comparison table described in the source, Psyche-R1 covers 3 tasks and 4 datasets, Psy-Interpreter covers 3 tasks and 6 datasets, PsychCounsel-Bench covers 1 task and 1 dataset, whereas MentraBench covers 6 tasks and 13 datasets (Xiao et al., 10 Dec 2025). It is also described as the only benchmark among those compared that explicitly covers appraisal, diagnosis, intervention, abstraction, verification, and multi-step reasoning while additionally evaluating reasoning chains along five qualitative dimensions (Xiao et al., 10 Dec 2025).
The limitations discussed are domain and data related. Many datasets are English-only and tied to specific platforms such as Reddit, Twitter, and YouTube; weak supervision introduces label noise in condition-detection tasks; the benchmark is text-only and excludes multimodal cues such as voice tone and facial expression; and high benchmark performance is not treated as guaranteeing safety in actual therapy, since models lack ethical accountability and emotional intelligence (Xiao et al., 10 Dec 2025). The paper also notes that misuse of LLMs in mental-health roles can amplify anxiety or misinformation if reasoning is incomplete or ungrounded (Xiao et al., 10 Dec 2025).
These limitations delimit the benchmark’s intended interpretation. MentraBench measures performance and reasoning quality on a broad set of mental-health tasks, but it is not presented as a certification of safe clinical deployment. This suggests that its primary role is methodological: to provide a more comprehensive and clinically structured basis for evaluating and post-training mental-health LLMs than prior benchmarks offered (Xiao et al., 10 Dec 2025).