PsychoBench: MCQ Benchmark for Psychological Evaluation

Updated 19 November 2025

PsychoBench is a multiple‐choice question benchmark modeled on the US National Counselor Certification Exam, featuring 2,252 exam-grade items.
It covers key areas including counseling techniques, abnormal and developmental psychology, ethics, and research methods for comprehensive assessment.
The dataset is generated through scraping, GPT-based paraphrasing, and expert review, with evaluation metrics aligned to a 70% pass threshold.

PsychoBench is a rigorously constructed multiple-choice question (MCQ) benchmark designed to evaluate whether LLMs possess the psychological knowledge necessary to perform at or above the minimum professional threshold required for psychological counseling licensure in the United States. Directly modeled on the U.S. National Counselor Certification Exam (NCE)—which requires approximately 70% accuracy for a passing score—PsychoBench comprises 2,252 single-correct-answer items. By mirroring the content structure and difficulty of the NCE, PsychoBench provides a standardized, exam-grade testbed for profiling the counseling and psychological reasoning proficiency of advanced neural models (Zeng, 2 Oct 2025).

1. Dataset Structure and Scope

Each PsychoBench item consists of a question stem (mean length 23.6 words) followed by 2–5 answer options (50.7% with four options, 48.9% with five, 0.36% with two), with exactly one option marked as correct. The items are systematically curated to encompass the principal sub-disciplines probed by the NCE:

Counseling Methods and Techniques: Core approaches (person-centered, cognitive-behavioral, systems theory) and intervention strategies.
Abnormal Psychology: Diagnostic criteria, symptomatology, and treatment of psychiatric disorders (mood, anxiety, psychotic disorders).
Developmental Psychology: Milestones, lifespan frameworks (e.g., Piaget, Erikson), and family dynamics.
Ethics and Professional Issues: Codes of ethics, confidentiality, dual relationships, mandated reporting, and related dilemmas.
Research Methods and Assessment: Items woven throughout, covering basic statistics, validity/reliability, and psychometric interpretation.

This coverage forces LLMs to demonstrate both factual expertise (e.g., diagnostic categories) and practical judgment (e.g., case conceptualization, ethical reasoning).

2. Data Generation, Curation, and Quality Control

The dataset was assembled through a three-phase pipeline:

a. Collection: Raw NCE-style MCQs were scraped from public exam-preparation repositories and expanded by subject-matter experts to fill topical gaps.

b. Paraphrasing and Refinement: Linguistic diversity and de-duplication were achieved using GPT-based paraphrase models, which also standardized terminology and removed awkward or legacy phraseology.

c. Manual Expert Review: All items were manually audited by licensed counselors and academic psychologists for accuracy, NCE-alignment, unambiguous phrasing, and clarity. Questions were anonymized and deduplicated; each item was verified to contain one—and only one—explicitly correct answer.

This multi-step process yields a professionally vetted question bank free of ambiguous or context-laden identifiers.

3. Content Representation and File Formats

Distributed under a CC-BY-NC-ND license on GitHub (https://github.com/cloversjtu/PsychoBench), the dataset is available in JSON and CSV formats. The schema for each question includes:

Field	Type	Description
id	string	Unique item identifier
question	string	The MCQ stem
options	dict	Answer choices indexed by label (e.g., {'A': ..., 'B': ...})
answer	string	The correct option label
subfield	string	(Optional) Domain: Counseling, Abnormal Psych, etc.
rationale	string	(Optional) Brief rationale for correct answer

Example:

{
  "id": "psy_000123",
  "question": "According to Eysenck’s theory, a highly sociable and relaxed person would be classified as:",
  "options": {
    "A": "Extraverted & stable",
    "B": "Introverted & unstable",
    "C": "Passive‐aggressive",
    "D": "Cyclothymic & dysthymic"
  },
  "answer": "A",
  "subfield": "Abnormal Psychology",
  "rationale": "High sociability = extraversion; calmness = emotional stability."
}

4. Evaluation Protocols and Metrics

Evaluation uses a zero-shot multiple-choice classification setup, with the entire 2,252-item dataset serving as the testbed (no train/dev/test splits; exam-simulation framing). The scoring protocols are directly aligned with the NCE standard.

Primary Metric: Top-1 accuracy

$\mathrm{accuracy} = \frac{\# \text{ correct predictions}}{\text{total } \# \text{ questions}}$

A score of 0.70 (70%) is the minimum professional threshold.

Secondary Metrics:
- Top-2 accuracy: Fraction where the correct answer is within the top two model choices.
- Weighted precision, recall, F1: Adjust for minor option imbalances across items.

No training or adaptation set is provided: the aim is to quantify “exam-readiness” via direct out-of-the-box capability.

5. Example Items and Interpretive Use

Sample entries demonstrate both factual recall and context-based reasoning:

Theoretical Taxonomy: “Juanita is well-liked by peers due to her sociable, relaxed, and energetic nature. Per Eysenck’s dimensions, she is:”
- A) Extraverted & stable (correct)
- B) Passive-aggressive
- C) Intrinsically motivated
- D) Introverted & unstable
Clinical Application: “A counselor uses role-play to help a client rehearse coping strategies. This technique is most directly an application of:”
- A) Gestalt therapy
- B) Cognitive-behavioral therapy (correct)
- C) Psychoanalytic free association
- D) Adlerian birth-order theory
Ethics and Legal Standards: “Breaking confidentiality is ethically permissible when:”
- A) The client threatens self-harm
- B) A court subpoenas records (correct)
- C) The client discontinues therapy
- D) Insurance requests billing details

6. Benchmarking, Application Domains, and Extensibility

PsychoBench serves four primary research applications:

LLM Benchmarking: Determining which models meet or exceed the 70% exam standard.
Fine-Tuning: Using items or held-out subsets to domain-adapt open-source models for psychological counseling.
Error Profiling: Analyzing performance by subfield to diagnose weaknesses (e.g., ethicolegal errors vs. diagnostic recall).
Curricular Integration: Supporting AI-in-medicine pedagogy by comparing human and machine performance on real licensing exam material.

Frontier foundation models—such as GPT-4o, Llama3.3-70B, and Gemma3-27B—achieve well above the NCE threshold, while contemporary smaller-capacity open-source models (e.g., Qwen2.5-7B, Mistral-7B) fall short. These results indicate that only the most advanced LLMs are presently capable of meeting professional psychological knowledge standards without additional targeted training (Zeng, 2 Oct 2025).

7. Position within the LLM Evaluation Landscape

PsychoBench differs substantively from psychometric portrayal datasets (e.g., AIPsychoBench (Xie et al., 20 Sep 2025, Huang et al., 2023)) and psychiatric clinical evaluation benchmarks (e.g., PsychBench (Liu et al., 28 Feb 2025), Psychiatry-Bench (Fouda et al., 7 Sep 2025)). Unlike self-report or scale-mapping inventories, PsychoBench directly tests for knowledge and applied reasoning as mandated by U.S. licensing frameworks, making it uniquely suitable for certification-aligned assessment and development of psychology-enabled LLMs. Its methodological rigor, professional audit, and public distribution provide a foundational resource for both AI-driven counseling research and psycholegal model validation.

References: