- The paper proposes a Tri-Retrieval RAG architecture fusing sparse, dense, and live web retrieval to significantly enhance factual accuracy and retrieval metrics.
- It employs dual-tier memory retention that integrates short-term context with long-term user history to achieve personalized, coherent, and empathetic interactions.
- A dedicated safety filter combined with a multimodal interface ensures robust moderation and culturally adaptive delivery for scalable mental wellbeing support.
Empathetic and Personalized LLM-Driven Conversational Agents for Wellbeing Support
Motivation and Problem Landscape
The global rise in mental health and wellbeing challenges has outpaced traditional support services due to resource constraints and accessibility barriers. Advances in LLMs and RAG architectures have enabled the development of conversational AI for wellbeing support. However, most existing solutions demonstrate superficial empathy, lack personalization, and suffer from hallucination and factual drift. Many are grounded in static scripted flows or address personalization and grounding in isolation. There is a clear demand for scalable virtual agents that exhibit contextually grounded, empathetic, and culturally adaptive support with robust safety mechanisms.
System Architecture and Methodology
The proposed framework integrates multimodal dialogue processing, a Tri-Retrieval RAG module, dual-tier memory retention, and a safety-filtered LLM pipeline for generating expressive and context-aware virtual agent responses.
Figure 1: System architecture depicting the pipeline from user speech to personalized and safe multimodal virtual agent response.
User engagement begins with speech input processed by a Whisper-based ASR model on-device to preserve privacy and enable real-time operation. Downstream, a Tri-Retrieval module fuses BM25 sparse retrieval, dense vector semantic search with Qwen3 embeddings, and dynamic web retrieval. This approach provides robust lexical recall, conceptual nuance, and access to live factual knowledge, optimizing both sensitivity and specificity of retrieved evidence.
Contextual retention is achieved via a dual-tier memory system: a short-term component capturing the 5 most recent dialogue turns for session coherence, and a long-term vector database encoding all prior user interactions for personalization. On each turn, both memories are fused and supplied with retrieved evidence to the LLM generator.
Prior to output, a dedicated Safety Filter LLM enforces content moderation. Only responses classified as safe are rendered through a multimodal virtual agent interface, including natural speech synthesis and affective 3D avatar animation.
Figure 2: Virtual agent interface with real-time generation, memory, and multimodal rendering capabilities.
Empirical evaluation on the CLAP NQ long-form QA benchmark demonstrates that the Tri-Retrieval strategy outperforms both lexical and single-stage semantic retrievers in both P@3/5 and R@3/5 metrics. Particularly, the fusion approach maximizes recall without sacrificing precision, as each retriever captures distinct relevance signals.
Regarding end-to-end answer quality on SQuAD, RAG-augmentation yields substantial gains across F1, ROUGE-L, and BERTScore, especially with smaller models such as LLaMA-3.2, indicating that external knowledge is essential for effective generation when parametric memory is limited. For instance, LLaMA-3.2 F1 increases from 0.0666 (zero-shot) to 0.4850 (Tri-Retrieval), and GPT-4o achieves a state-of-the-art 0.7181 F1.
Subjective and Cross-Cultural Evaluation
A human-subjects study (n=21, university students; Vietnam n=11, Australia n=10) was conducted to benchmark system usability, empathy, and coherence. The experiment contrasts the proposed RAG system with an LLM-only baseline in a counterbalanced setup, measuring Effectiveness, Coherence, and User Perception.
Figure 3: Example of participant interaction during the subjective evaluation session.
Statistical analysis using Wilcoxon signed-rank tests yields the following:
Cross-cultural subgroup analysis further establishes robustness of these findings; both cultural groups show significant gains in coherence and user perception, indicating that the benefits of RAG-augmentation and memory retention generalize across cultural boundaries.
Theoretical and Practical Implications
The integration of Tri-Retrieval RAG with dual-tier memory consolidation directly addresses the primary challenges of hallucination, shallow empathy, and generic dialogue that have limited prior wellbeing-focused agents. Notably, the approach demonstrates that advanced RAG pipelines are crucial not only for factuality but also for sustained personalization and context-aware empathy. Furthermore, the safety LLM ensures that deployments in sensitive clinical or educational environments can enforce robust content moderation without sacrificing autonomy.
The cross-cultural consistency of the enhancements points toward practical viability for globally scalable agents, provided that legal, privacy, and localization requirements (especially data encryption and adversarial robustness) are met.
Future Research Directions
Longitudinal trials are needed to investigate retention and therapeutic impact over extended timescales beyond the 10-15 minute sessions explored in this study. Additionally, adversarial attacks on the safety filter, dynamic adaptation to emotional trajectories, and comprehensive data privacy (including encryption and differential privacy for long-term user memory) remain to be addressed. Extending the agent to diverse languages and demographic groups will further evaluate its generality.
Anticipated future developments include adaptive retrieval policies conditioned on both conversational context and real-time emotional cues, as well as integration of self-reflective architectures to improve memory fusion and response accuracy dynamically.
Conclusion
The proposed virtual agent framework demonstrates significant advances in delivering empathetic, personalized, and robustly factual conversational support for mental wellbeing applications. Its combined Tri-Retrieval, personalized memory, and multimodal rendering architecture achieves statistically significant improvements in coherence and perceived empathy and is preferred by the vast majority of users in both Vietnamese and Australian cohorts. The results indicate that knowledge-grounded dialogue agents can provide scalable, culturally inclusive, and engaging wellbeing support, though further development is required to ensure privacy, safety, and longitudinal efficacy.