ChatCLIDS: Benchmark for Persuasive Dialogue in Diabetes
- ChatCLIDS is a benchmark simulation framework that assesses LLM-powered persuasive dialogues to promote closed-loop insulin delivery adoption in type 1 diabetes.
- It leverages expert-validated virtual patient profiles and dual dialogue paradigms to systematically evaluate personalized, adaptive, and longitudinal persuasive strategies.
- Empirical findings reveal that while larger LLMs and chain-of-strategy protocols increase persuasion, all models struggle under adversarial social influences.
ChatCLIDS is a benchmark and simulation framework developed to evaluate the effectiveness of LLM-driven persuasive dialogues in promoting the adoption of closed-loop insulin delivery systems (CLIDS) for type 1 diabetes care. Distinct from technical or algorithmic evaluations, ChatCLIDS targets behavior change dialogues, modeling the interplay of psychosocial, behavioral, and social resistance barriers through expert-validated virtual patients and multi-turn, agent-driven simulations of nurse–patient interaction. The framework provides the first high-fidelity, scalable testbed for methodically assessing AI-driven persuasive strategies and their limitations in health intervention contexts (Yao et al., 31 Aug 2025).
1. Purpose and Focus
Real-world adoption of closed-loop insulin delivery systems in type 1 diabetes is hampered by non-technical factors: patient skepticism, anxiety about device management, lifestyle compatibility concerns, and adversarial social influence. The objective of ChatCLIDS is to operationalize these barriers in a simulation environment that enables critical paper of how persuasive AI—modeled as nurse agents using LLMs—can facilitate increased CLIDS adoption.
The benchmark is distinguished by its focus on both individualized resistance (as encoded in virtual patient profiles) and dynamic, longitudinal counseling scenarios. The intent is to provide rigorous, systematic evaluation that is grounded in clinical and behavioral theory, with support for adversarial and social resistance scenarios that mirror real-life complexities of uptake in diabetes self-management.
2. Framework and Simulation Methodology
The ChatCLIDS simulation consists of two main agent types: patient agents and nurse (AI) agents.
- Patient Agents are instantiated from a library of expert-validated virtual patient profiles. Each profile is constructed using real-world, de-identified narratives and augmented through expert curation and feature engineering. Variables encoded for each agent include demographics, clinical attributes, psychosocial factors, adoption barriers, and a grading of resistance calibrated as "Easy," "Medium," or "Hard."
- Nurse Agents function in two dialogue paradigms:
- Direct Prompting (DR): The agent generates responses using a catalog of 31 evidence-based persuasive strategies (e.g., Evidence-based Persuasion, Logical Appeal, Social Proof, Foot-in-the-door).
- Chain-of-Strategy (CoS): The agent must first identify and justify, via natural language, its intended persuasive strategies before composing the response, enforcing explicit strategy reflection and increasing transparency in reasoning.
Three experimental configurations are enabled:
- Single-Visit: Multi-turn exchanges analogous to a single clinic visit.
- Multi-Visit: Longitudinal counseling with between-visit memory and opportunity for strategy self-critique and adaptation.
- Social Resistance: Incorporates adversarial social agents (simulating peer pressure and misinformation), enabling evaluation under realistic social context and exposure to misinformation.
A central evaluation metric is the Normalized Persuasion Rating (NPR), which quantifies the effectiveness of a dialogue session in shifting a patient’s attitude toward CLIDS adoption. The metric is defined as:
where is the initial persuasion rating and is the closing persuasion rating after the dialogue.
3. Agent Initialization, Dialogue Strategies, and Scenario Design
Each patient agent is initialized with parameters reflecting real-world and clinically-grounded adoption barriers. Feature engineering stratifies each patient on a resistance spectrum. The inclusion of multifaceted variables supports simulation of heterogeneous responses and adaptive needs across clinically relevant scenarios.
Nurse agents, depending on the scenario, leverage an explicit arsenal of persuasive strategies. The direct prompting paradigm selects and deploys strategies as needed per interaction turn. The chain-of-strategy paradigm introduces an intermediate step in which the agent must rationalize its choice of strategies, thereby supporting studies of model self-reflection and adaptation mechanisms.
Three scenario types reflect core challenges of real-world counseling:
- Single-Visit: Model performance under health counseling time constraints.
- Multi-Visit: The degree to which repeated, memory-aware interventions can accumulate efficacy.
- Social Resistance: The impact of simulated adverse social context on model persuasive efficacy.
4. Experimental Results and Empirical Findings
Empirical evaluation was performed across more than a dozen LLMs spanning a range of capacities and architectures, in both single- and multi-visit settings. Crucial findings include:
- Scaling with Model Size: Larger and more advanced LLMs achieved increased persuasion, but all models plateaued when facing "medium" and "hard" patient profiles.
- Reasoning and Reflection: The chain-of-strategy protocol (requiring explicit justification of strategy) reliably enhanced the adaptivity and effectiveness of agent responses, especially across extended sessions.
- Social Context Sensitivity: Introduction of adversarial social agent influence markedly decreased model persuasion success, with all LLMs displaying limited capacity to adapt or counteract misinformation and peer pressure.
- Evaluation Robustness: Automated (LLM-as-judge) and human expert evaluations corroborate the limitations of current LLMs, particularly their struggle with persistent or complex resistance, despite superficial displays of responsiveness, empathy, and clinical relevance.
These results collectively highlight that even state-of-the-art models are not yet robust against the nuanced and adaptive social factors present in real-world health behavior change contexts.
5. Limitations and Methodological Constraints
Several limitations are inherent in the current ChatCLIDS design:
- Model Robustness: No current LLM demonstrated the ability to systematically overcome deeply ingrained resistance, particularly under adversarial social conditions.
- Simulation Fidelity: While patient agents are expert-validated, certain aspects of real-world patient emotion, variability, and context are not fully captured.
- Scope: Bound to English-language, North American sociocultural norms, thus not generalizing to all global diabetes populations or health systems.
- Translation to Clinic: As a simulated evaluation, the direct clinical effectiveness and patient safety considerations of LLM-driven persuasive AI require further empirical validation and ethical review.
6. Directions for Future Research
The authors identify several forward paths:
- Patient Agent Granularity: More nuanced models of patient affect, state-dependent reaction, and cultural context.
- Multi-Modal Cue Integration: Expanding beyond text by incorporating voice tone or facial expression.
- Advanced Adaptive Strategies: Enhancing nurse agent capacity for early recognition and change of ineffective strategies, improving situational awareness.
- Real-World Validation: Extending simulation findings to pilot studies and standardized patient encounters in clinical settings.
- Ethical Safeguards: Building transparency, autonomy protections, and safety checks to ensure responsible persuasive AI use in health.
A plausible implication is that, while ChatCLIDS reveals a promising methodology for systematic, scalable evaluation of health behavior change dialogue systems, it simultaneously underscores the acute challenge of robust, context-aware persuasive AI—especially in adversarial real-world settings.
7. Significance for Persuasive AI and Healthcare
ChatCLIDS establishes a benchmark for future work at the intersection of persuasive AI, patient simulation, and health informatics. Its contributions include a first-of-kind protocol for systematic evaluation of multi-turn, agent-driven, personalized and adaptive persuasive dialogues targeting tractable and recalcitrant behavioral barriers. The insight gained regarding the limitations of current LLMs, especially in the face of social resistance and complex behavioral dynamics, provides critical direction for research toward clinical-grade, trustworthy persuasive dialogue systems in healthcare and beyond.