- The paper introduces an on-device multi-LLM platform that enforces zero data egress for privacy-preserving psychiatric decision support.
- It implements a layered architecture with quantized models, agentic orchestration, and tailored clinical UIs to meet DSM-5 standards.
- Preliminary demo results show sub-10s latency and diagnostic accuracy comparable to cloud-based systems, highlighting a trade-off between privacy and performance.
Zero-Egress LLM Deployment for Privacy-Preserving Psychiatric Decision Support
Context and Motivation
The psychiatric diagnostic process is characterized by subjectivity, marked inter-rater variability, and an absence of objective biomarkers. In operationally sensitive domains—military, corrections, and remote medicine—privacy risks associated with cloud-based clinical AI are a major adoption barrier. The paper "Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support" (2604.18302) addresses these constraints by presenting a zero-egress, cross-platform mobile psychiatric AI system, built on a fully local fine-tuned LLM ensemble, that enforces architectural privacy guarantees by prohibiting any patient data transmission and by sidestepping institutional “shadow IT” and legal risk vectors.
System Architecture and Security Model
The system implements a layered on-device architecture (Figure 1):
Figure 1: Zero-egress architecture for psychiatric AI integrating a mobile LLM ensemble, a model-agnostic orchestration layer, and UI modes that enforce clinical safeguards and session isolation.
- Layer 1 (Model Layer): Three open-source LLMs (Gemma, Phi-3.5-mini, Qwen2), each fine-tuned on psychiatrist-patient conversation datasets aligned with DSM-5 criteria, are QLoRA-quantized (4-bit) and statically packaged for platform-native execution.
- Layer 2 (Orchestration): A device-resident agentic orchestration layer coordinates prompt specialization, concurrent or sequential model inference, and implements a consensus-based, confidence-weighted diagnostic aggregation with DSM-5 cross-validation enforcement.
- Layer 3 (Clinical UI): Distinct UI interaction flows are presented for clinicians (full differential output, criterion mapping, confidence display) and patients (guided self-screen, escalation/safety routing), with session-level isolation and explicit on-device-only record management.
- Security and Privacy: The zero-egress guarantee precludes all AI inference, storage, analytics, and update egress at the process and OS level; persistent data uses device-bound crypto-keyed encryption. No telemetry or analytics subsystems, and no inference sockets are opened during runtime.
End-to-End On-Device Model Adaptation and Deployment
Every model passes through an explicit pipeline ensuring all fine-tuning, quantization, and deployment occur off-device, and only static weights are shipped in the mobile application bundle.
Figure 2: End-to-end offline fine-tuning/quantization/export/deployment protocol—clinical data never leaves the device during inference.
- Psychiatrist-patient conversational datasets (DSM-5 aligned, multi-condition) are formatted for instruction-style fine-tuning (Unsloth framework, QLoRA, 4-bit, LoRA adapters folded pre-quantization).
- Models are exported for device-native runtimes: Gemma/Qwen2 as GGUF (llama.cpp compatible), Phi-3 as ONNX/INT4.
- The quantized ensemble fully executes on mobile CPUs/NPUs; disk+RAM requirements for the full stack are ~5GB, well within consumer device capabilities.
Ensemble Inference, Consensus, and Explainability
Clinical conversations are processed using a deterministic inference workflow:
Figure 3: Ensemble inference flow—prompt engineering, parallel model invocation, and consensus-based DSM-5 diagnostic reasoning are performed strictly on-device.
- Each model receives a conversation prompt engineered with embedded DSM-5 context.
- Outputs are required to follow a strict structured schema (diagnosis, DSM-5 code, confidence, supporting symptoms, and differential list).
- The consensus reasoner applies a confidence-weighted voting protocol with cross-validation against an app-embedded DSM-5 symptom checklist. Only diagnoses supported by two or more models and passing criterion validation are surfaced.
- Non-conforming model outputs are isolated/retried with hard cutoffs; session outputs are UI-labeled with provenance ("Cloud AI" vs. "Private AI") for legal/clinical auditability.
Clinical Functionality, Task Flows, and Modes
The platform is surfaced as a mobile clinical decision support application with explicit modal selection:
Figure 4: Settings panel: Selection between three inference modes—with "Private AI" enforcing on-device-only execution and explicit privacy guarantees.
- Model selection: Cloud AI (server-side), Private AI (on-device multi-model), BYOK (user-supplied API key). All psychiatric decision support and self-screening functions are only enabled in Private AI mode for egress-free operation.
- UI Flows & Integration: UI accommodates both document-based and conversational input (Figure 5), session isolation, and rapid contextual prompt selection (Figure 6). SOAP note workflows require four-fold clinical input and return attributions for each AI-generated record (Figure 7), reinforcing transparency and audit trails.
Figure 5: Home screen—integrated document and clinical function cards, supporting real-time diagnosis, coding, and research tasks.
Figure 6: Contextual conversational interface for real-time clinical queries.
Figure 7: SOAP note generation task—structured request and provenance attribution for every AI output, with persistent mode labeling.
Preliminary demo-based latency measurements for on-device inference (Gemma Fast variant, ~529MB) yield a Time-to-First-Visible-Response (TTFVR) of 7.5–8.0s versus 2.5–3.0s for the cloud reference. While this incurs a privacy-for-latency trade-off, TTFVR remains within accepted clinical workflow tolerances (sub-10s). The on-device model stack delivers diagnostic accuracy and structure comparable to prior server-based fine-tuned LLM+OpenAI-oss platforms, demonstrating comparable alignment with DSM-5 diagnostic boundaries and clinical coherence, though formal accuracy metrics vs. expert ground truth are deferred to forthcoming validation rounds.
Theoretical and Practical Implications
The explicit architectural decoupling of inference from all forms of data egress reifies a “privacy-by-design” paradigm, as advocated in HIPAA, GDPR, and NIST SP 800-207 Zero Trust (2604.18302) contexts. This hardens psychiatric AI deployment against both accidental (shadow IT, misconfiguration, third-party platform leaks) and intentional (telemetry, analytics, update channels) privacy breaches. Consensus-based multi-model ensemble reasoning enforces robust, repeatable outputs and mitigates idiosyncratic bias present in single-model settings.
The inclusion of agentic orchestration and consensus logic aligns with recent work on robust multi-agent LLM governance and explainable consensus-driven XAI for clinical decision support [towards-rai-xai, agentic-ai, LLM-ensemble, reconcile]. Enforcing deterministic input/output schemas, session isolation, and explicit attribution/label display establishes a technical foundation for prospective policy and regulatory compliance, including forthcoming ISO/IEC 42001:2023 and EU AIA requirements.
Outlook and Future Extensions
Key future work tracks include: large-scale expansion of the fine-tuning dataset for broader diagnostic generality, longitudinal clinical performance validation under blinded expert review, systematic latency/memory profiling across device classes, integration of federated learning for privacy-consistent model updates [FedMentalCare2025, Rieke2020], and exploration of voice/multimodal input. These will be essential for operational deployment in military, correctional, and remote medicine, where both clinical fidelity and data sovereignty are non-negotiable requirements.
Conclusion
This work delivers a deployable, on-device multi-LLM psychiatric decision support platform that enforces zero-egress as a system property, not a policy overlay. Diagnostic reasoning is architected to be explainable, robust, and transparent, aligning with clinician workflow and regulatory mandates for privacy preservation in high-stakes psychiatric care. The architectural choices exemplify the formalization of agentic AI in practical clinical contexts and position this approach for extension to additional sensitive domains where privacy and explainability are strictly mandated.