Papers
Topics
Authors
Recent
2000 character limit reached

MindCorpus: Synthetic Counseling Data

Updated 12 January 2026
  • MindCorpus is a synthetic multi-turn counseling corpus that simulates realistic therapeutic sessions between seekers and supporters.
  • It employs a dual-loop, multi-agent system with specialized roles to refine counseling responses through turn-level critique and session-level strategy updates.
  • Designed with strict differential privacy safeguards and reproducible protocols, it benchmarks privacy-preserving techniques in mental healthcare LLMs.

MindCorpus is a synthetic multi-turn counseling corpus designed to facilitate the development and evaluation of privacy-preserving LLMs for mental health support. Comprising 5,700 sessions, each constructed as a realistic simulation of a psychological consultation between a “seeker” (client) and a “supporter” (counselor), MindCorpus leverages a dual closed-loop, multi-agent role-playing framework that rigorously integrates psychological expertise at both the turn and session levels, while maintaining strict differential privacy guarantees throughout its generation pipeline (Xue et al., 5 Jan 2026).

1. Dual-Loop Multi-Agent Role-Playing Construction

The creation of MindCorpus centers on a cooperative dialogue-generation system involving six specialized agents, each tasked with a distinct functional role:

  • Extractor: Processes raw “seed” scenario descriptions (sourced from public counseling forums) and yields structured triplets of the form (Character, Plight, Demand), providing a precise “who, what, how” initialization for the Seeker agent.
  • Seeker: Emulates a client in psychological distress, proceeding through a six-stage self-disclosure regimen: self-introduction, situation description, emotional expression, reminiscence, requesting assistance, and asking for advice. This protocol is designed to reflect authentic help-seeking progression.
  • Supporter: Acts as the counselor, implementing therapeutic strategies parameterized by a session-level strategy vector θs\theta^s, which is refined iteratively across sessions.
  • Evaluator: At each dialogue turn tt, assesses the Supporter’s response utsu^s_t on nine professional dimensions—confidentiality, objectivity, sympathy, specialization, feasibility, listening, collaboration, tenderness, and respect—returning either a “pass” or a feedback vector ftR9f_t \in \mathbb{R}^9 identifying required improvements.
  • Corrector: When flagged by the Evaluator, revises utsu^s_t using ftf_t to generate an improved utterance utsu^{\prime s}_t.
  • Manager: Collects per-turn feedback {f1,,fT}\{f_1, \dots, f_T\} over a session and computes a session-level update Δθs\Delta \theta^s, using M({f1,,fT})M(\{f_1, \dots, f_T\}), to refine counseling strategies for subsequent sessions.

These agents are organized into two nested feedback loops: (i) a turn-level critique-and-revision cycle that ensures per-utterance coherence and clinical appropriateness, and (ii) a session-level strategy refinement step that adapts counselor behavior by aggregating feedback into updates for the session-level controller. The formal session-level update follows:

θnews=θolds+ηg({ft}t=1T)\theta^s_{\text{new}} = \theta^s_{\text{old}} + \eta \cdot g(\{f_t\}_{t=1}^T)

where gg aggregates the per-turn feedback into a global session refinement signal.

2. Corpus Statistics and Thematic Coverage

MindCorpus’s scale and diversity are engineered to surpass prior synthetic and real-world counseling datasets. Key statistics are summarized below:

Metric MindCorpus Additional Information
Number of sessions 5,700
Avg. turns per session 12.0 σ2.1\sigma \approx 2.1
Avg. words per utterance 84 σ15\sigma \approx 15
Number of client themes 10 Emotional management, growth, anxiety relief, maintenance, workplace, etc.
Data split (train/val/test) 80 %/10 %/10 % Random session sampling

The corpus achieves balanced representation across ten thematic domains, with figure visualizations (e.g., Figure 1 in (Xue et al., 5 Jan 2026)) confirming depth and breadth over common mental health concerns. Sampling is stratified to support robust benchmarking and analysis.

3. Quality Control Protocols

Quality assurance in MindCorpus is applied at multiple points in the pipeline:

  • Automated, In-Loop Validation: The Evaluator’s nine-dimension scoring prevents incoherent, unprofessional, or ethically questionable Supporter responses. An explicit maximum single-turn length constraint of 100 words enforces realistic conversational pacing.
  • Post-hoc Evaluation: Fifty randomly selected sessions are rated by GPT-4o using five metrics—Professionalism (Pro.), Helpfulness (Hel.), Guidance (Gui.), Emotion regulation (Emo.), and Trust (Tru.)—from both counselor and client viewpoints. Parallel assessments are performed by four master’s-level psychology students. Spearman’s ρ between GPT-4o and human ratings generally exceeds 0.65, confirming substantial correspondence between automated and expert judgment.
  • Privacy-Preserving Annotations: Extracted triplets contain no residual real-world personal identifiers. During federated fine-tuning, client updates are clipped and noise-perturbed to satisfy (ϵ,δ)(\epsilon,\delta)-differential privacy; model performance on membership inference attacks (MIA) yields ROC AUC and PR AUC values near 0.7 (for ϵ=1\epsilon = 1, δ=105\delta = 10^{-5}), supporting strong privacy defense.

4. Exemplary Dialogue and Feedback Dynamics

MindCorpus explicitly annotates the operation of its feedback loops in multi-turn sessions. For example, in a session themed on anxiety relief:

  • The Seeker introduces concerns (“constant worry about minor tasks… hard to sleep”).
  • Supporter-v1’s initial empathetic but vague response triggers the Evaluator (“Feasibility: Low; Listening: Medium”), prompting the Corrector to generate a more targeted follow-up.
  • Further interactions follow, with each Supporter utterance scrutinized for adherence to clinical and communicative standards. The Manager subsequently aggregates feedback (“vague task framing”) and incorporates strategic improvements (“use guided progressive relaxation”) into the counselor’s policy vector for future sessions.

Such exemplars illustrate the iterative enhancement of counselor behavior, driven by simulated but tightly regulated professional feedback.

5. Privacy-Preserving Corpus Generation

MindCorpus implements federated fine-tuning for downstream LLMs, using parameter-efficient LoRA adapters to minimize communication and computational overhead: ΔWit=BitAit\Delta W_i^t = B_i^t A_i^t, rmin(d,k)r \ll \min(d,k). Differential privacy is realized via the Gaussian mechanism, with client-level update perturbation:

Δθ~it=ClipC(Δθit)+N(0,σ2I)\Delta\tilde\theta_i^t = \text{Clip}_C(\Delta\theta_i^t) + \mathcal{N}(0, \sigma^2 I)

where the noise level σ=S2ln(1.25/δ)ϵ\sigma = \tfrac{S\sqrt{2\ln(1.25/\delta)}}{\epsilon}. Formal privacy guarantees are derived from the composition theorem for Gaussian DP, ensuring aggregate privacy expenditure remains within the prescribed (ϵ,δ)(\epsilon, \delta) budget over T=100T=100 communication rounds.

Final MIA evaluations (LOSS, Min-k, Zlib) demonstrate that ROC AUC converges to approximately 0.72 at ϵ=1\epsilon=1, which is close to random guessing, confirming robust protection against privacy leakage.

6. Distinctive Properties and Research Significance

MindCorpus’s principal distinguishing features include:

  • Synthetic yet High-Fidelity and Scalable: Generation of 5,700 realistic counseling sessions without dependence on limited or sensitive real-world dialogues distinguishes MindCorpus as a scalable resource.
  • Integrated Expert Oversight: Dual feedback loops—turn-level and session-level—anchor each synthetic session in established counseling principles, enabling per-utterance and strategic corrections.
  • Rigorous Privacy Safeguards: The end-to-end design underscores privacy by combining federated LoRA fine-tuning and tight (ϵ,δ)(\epsilon, \delta)-differential privacy constraints—features salient for the ethical development of mental health–oriented LLMs.
  • Reproducibility: Accompanying codebooks, detailed agent prompts, and openly defined data splits foster transparent and reproducible benchmarking for therapeutic dialogue generation under privacy constraints.

These properties enable MindCorpus to serve as a unique, methodologically rigorous, and ethically vetted dataset for research on confidential conversational agents and privacy-aware modeling in digital mental health domains (Xue et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to MindCorpus.