PsyProbe: Proactive Counseling Dialogue System
- PsyProbe is a proactive LLM-driven psychological dialogue system that integrates structured clinical profiling, cognitive error detection, and MI-based dialogue planning.
- It employs a four-stage cascade architecture, including state tracking, memory construction, strategy planning, and iterative response generation for targeted exploration.
- Empirical evaluations reveal substantial improvements in therapeutic quality, proactivity, and domain interpretability compared to standard reactive LLM baselines.
PsyProbe is a LLM-driven dialogue system specifically engineered for the exploration phase of counseling, characterized by systematic modeling of client psychological state, explicit detection of cognitive errors, and proactive, interpretable dialogue management. The system leverages a cascade architecture integrating fine-grained state tracking with Motivational Interviewing (MI) code planning to generate contextually appropriate, evidence-based counselor utterances. PsyProbe demonstrates substantial improvements in proactivity, domain interpretability, and therapeutic quality relative to standard LLM baselines in empirical evaluation within Korean counseling scenarios (Park et al., 27 Jan 2026).
1. System Architecture and Pipeline
PsyProbe operates through a four-stage cascade of interpretable, LLM-driven modules: State Builder, Memory Construction, Strategy Planner, and Response Generator. Each component addresses a distinct functional requirement in simulating exploratory, evidence-oriented counseling dialogue:
- State Builder: Transforms each user turn into a structured psychological state profile, leveraging explicit slot filling based on clinical formulation.
- Memory Construction: Maintains a dual-layer memory of conversation history and a dynamic overall summary, supporting information-gap identification across core psychological dimensions.
- Strategy Planner: Selects MI behavioral codes (Simple/Complex Reflection, Open/Closed Question, Affirm, Give Information, Advise, General) and plans speech acts, therapeutic goals, and style hints.
- Response Generator: Executes a three-step process—Question Ideation (targeting highest-priority information gaps), Draft Generation (assembling reflections and questions), and Critic/Revision (iterative quality control)—yielding a final, context-sensitive utterance.
This multi-stage architecture enables a shift from reactive response generation to proactive clinical interviewing, supported by interpretable state modeling at each stage (Park et al., 27 Jan 2026).
2. The PPPPPI Psychological Formulation Framework
Central to PsyProbe is its adoption of the PPPPPI clinical formulation framework, an extension of standard psychological models, which allocates six structured slots for each user utterance:
| Slot | Definition | Editor's term |
|---|---|---|
| Presenting | Current expressed problems or complaints | P₁ |
| Predisposing | Longstanding vulnerabilities or background factors | P₂ |
| Precipitating | Recent triggering events or temporal cues | P₃ |
| Perpetuating | Factors (e.g., cognitive distortions) that maintain the problem | P₄ |
| Protective | Client strengths or support systems | P₅ |
| Impact | Functional impairment across life domains | P₆ |
Given an utterance , the slot-filling function for each slot is defined as:
If no spans are present, . Model outputs are structured as JSON, ensuring machine-readable, auditable psychological profiles driving downstream reasoning (Park et al., 27 Jan 2026).
3. Cognitive Error Detection
In advance of PPPPPI slot alignment, PsyProbe deploys explicit detection and extraction of cognitive errors across four categories motivated by Beck (1976) and Lefebvre (1981): catastrophizing, overgeneralization, personalization, and selective abstraction (should/must statements). The detection process is operationalized as a multi-label classification function:
Each reflects the LLM-estimated probability of the corresponding cognitive error. For any exceeding threshold (typically $0.5$), the system employs chain-of-thought prompting to extract explicit evidence spans. The resulting cognitive error flags inform state modeling fidelity and determine where deeper PPPPPI or Theory of Mind (ToM) reasoning is appropriate (Park et al., 27 Jan 2026). This suggests the system introduces a systematic mechanism for diagnostic error tracking, enhancing interpretability in LLM-guided counseling.
4. Memory Construction and Information Gap Quantification
Memory Construction maintains the internal conversation state, , as a combination of a turn-history buffer (keywords, event–context, and emotion–trigger pairs) and a succinct overall summary (core narrative, dominant emotion, recurring themes, PPPPPI slot values). Critical to proactive exploration, the system quantifies an "information gap score" per slot , formally:
where are binary signals indicating missing content, weak provenance, outdatedness, and so on; with weights respectively. The resulting ranking directly informs which dimensions are targeted for further elicitation in the dialogue (Park et al., 27 Jan 2026). A plausible implication is the system's ability to dynamically prioritize inquiry for underexplored psychological domains, thereby facilitating more comprehensive assessments.
5. Motivational Interviewing Strategy Planning
The Strategy Planner module bridges evidence-tracked psychological gaps and conversational acts by selecting two MI speech-act codes in sequence. Selection is made via few-shot prompting over the Korean KMI corpus, with a two-round process:
- Round 1:
- Round 2: Mask same-category labels,
Each label (Simple/Complex Reflection, Open/Closed Question, Affirm, Give Info, Advise, General) is output alongside a rationale in the interaction language. A plan is then structured for each act, comprising therapeutic goals, key content points, and style cues, operationalizing the MI code into concrete action constraints for LLM-controlled language generation (Park et al., 27 Jan 2026).
6. Proactive Response Generation: Question Ideation and Iterative Critique
The Response Generator employs a three-step algorithmic process:
- Question Ideation: For the PPPPPI slots with maximal , candidate questions are proposed, each tagged with slot , intent label, and confidence .
- Draft Generation: Synthesizes a draft utterance by combining reflection acts (as guided by the MI plan) and the top-scoring question , ensuring length (≤4 sentences) and empathy constraints.
- Critic & Revision: Applies a decision operation with , making structural edits if warranted. Revising stages select contextually appropriate replacement questions from the candidate pool, finalizing the response before delivery.
This mechanism systematically reduces redundancy and maintains alignment with both psycho-diagnostic and conversational objectives (Park et al., 27 Jan 2026).
7. Empirical Evaluation and Comparative Performance
PsyProbe underwent a multi-pronged evaluation procedure:
- Automatic Evaluation: Lexical overlap (ROUGE-1/2/L, BLEU-1/2/3/4) and embedding similarity (BERTScore-F1) were computed against human-counselor references across 27 sessions. Ablation of system components (State Builder, Strategy Planner, Question Ideation/Critic) provided sensitivity analysis.
- Full system achieved ROUGE-1 = 0.2277 (vs. baseline 0.2127, +7.0%) and BERT-F1 = 0.5540 (vs. 0.5428).
- Removal of State Builder or Question Ideation/Critic components led to the most substantial performance degradation (e.g., ROUGE-1 drops to 0.2181 or 0.2223).
- User Evaluation: 27 participants rated the system on Coherence, Fluency, Naturalness, Empathy, Safety (1–5 scales) and Engagement Intention (0–2).
- Marked improvements in Naturalness (3.72 vs. 2.87) and Engagement Intention (1.19 vs. 0.33), with system ratings approaching human performance benchmarks.
- Expert Evaluation: A certified counselor rated Core Issue Understanding (CIU, 1–5), Probing Question Quality (PQQ, 0–2), and Question Rate (QR) for GPT baseline, PsyProbe, and human reference.
- PsyProbe's question rate (0.815) matched human levels (0.830), substantially exceeding GPT (0.263).
- CIU (3.37 vs. 1.15) and PQQ (1.44 vs. 0.95) significantly improved (p < 0.01, Wilcoxon signed-rank).
The empirical evidence validates that explicit user-state modeling, information gap-driven planning, and iterative question generation substantially enhance conversational proactivity and interpretability in therapeutic dialogue settings compared to reactive LLM systems (Park et al., 27 Jan 2026).
8. Illustrative Dialogue Comparison
A distilled dialogue excerpt demonstrates the qualitative distinction:
- User: "Everyone else seems settled in jobs, and I feel anxious I’m behind."
- GPT Baseline: "I understand that can be stressful. Have you tried talking to a friend or counselor?"
- PsyProbe: "It sounds like comparing yourself with others intensifies your anxiety. Was there a particular event or background that led you to start feeling this way?"
This comparison exemplifies PsyProbe's focus on targeted exploration of under-specified psychological dimensions, specifically background and triggers, as opposed to the more generic and reactive baseline queries (Park et al., 27 Jan 2026). The system's evidence-based utterance construction underlines its advanced interpretability and alignment with clinical interviewing best practices.
In summary, PsyProbe integrates structured psychological profiling, cognitive error tracking, MI-based dialogue planning, and iterative, evidence-driven generation to advance the state of proactive and interpretable counseling dialogue systems. Its multilayered, modular architecture and empirically validated impact mark a significant methodological advance over prior reactive LLM-based approaches (Park et al., 27 Jan 2026).