Can Virtual Agents Care? Designing an Empathetic and Personalized LLM-Driven Conversational Agent

Published 22 Apr 2026 in cs.HC | (2604.20948v1)

Abstract: Mental health challenges are rising globally, while traditional support services face limited availability and high costs. LLMs offer potential for conversational support, but often lack personalization, empathy, and factual grounding. A virtual agent framework is introduced to provide empathetic, personalized, and reliable wellbeing support through retrieval-augmented architecture, structured memory, and multimodal interaction. Objective benchmarks demonstrate improved retrieval and response quality, particularly for smaller models. A cross-cultural study with university students from Vietnam and Australia shows the system outperforms LLM-only baselines in coherence, perceived accuracy, and empathy, with most participants clearly preferring the proposed approach.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper proposes a Tri-Retrieval RAG architecture fusing sparse, dense, and live web retrieval to significantly enhance factual accuracy and retrieval metrics.
It employs dual-tier memory retention that integrates short-term context with long-term user history to achieve personalized, coherent, and empathetic interactions.
A dedicated safety filter combined with a multimodal interface ensures robust moderation and culturally adaptive delivery for scalable mental wellbeing support.

Empathetic and Personalized LLM-Driven Conversational Agents for Wellbeing Support

Motivation and Problem Landscape

The global rise in mental health and wellbeing challenges has outpaced traditional support services due to resource constraints and accessibility barriers. Advances in LLMs and RAG architectures have enabled the development of conversational AI for wellbeing support. However, most existing solutions demonstrate superficial empathy, lack personalization, and suffer from hallucination and factual drift. Many are grounded in static scripted flows or address personalization and grounding in isolation. There is a clear demand for scalable virtual agents that exhibit contextually grounded, empathetic, and culturally adaptive support with robust safety mechanisms.

System Architecture and Methodology

The proposed framework integrates multimodal dialogue processing, a Tri-Retrieval RAG module, dual-tier memory retention, and a safety-filtered LLM pipeline for generating expressive and context-aware virtual agent responses.

Figure 1: System architecture depicting the pipeline from user speech to personalized and safe multimodal virtual agent response.

User engagement begins with speech input processed by a Whisper-based ASR model on-device to preserve privacy and enable real-time operation. Downstream, a Tri-Retrieval module fuses BM25 sparse retrieval, dense vector semantic search with Qwen3 embeddings, and dynamic web retrieval. This approach provides robust lexical recall, conceptual nuance, and access to live factual knowledge, optimizing both sensitivity and specificity of retrieved evidence.

Contextual retention is achieved via a dual-tier memory system: a short-term component capturing the 5 most recent dialogue turns for session coherence, and a long-term vector database encoding all prior user interactions for personalization. On each turn, both memories are fused and supplied with retrieved evidence to the LLM generator.

Prior to output, a dedicated Safety Filter LLM enforces content moderation. Only responses classified as safe are rendered through a multimodal virtual agent interface, including natural speech synthesis and affective 3D avatar animation.

Figure 2: Virtual agent interface with real-time generation, memory, and multimodal rendering capabilities.

Retrieval and Generation Performance

Empirical evaluation on the CLAP NQ long-form QA benchmark demonstrates that the Tri-Retrieval strategy outperforms both lexical and single-stage semantic retrievers in both P@3/5 and R@3/5 metrics. Particularly, the fusion approach maximizes recall without sacrificing precision, as each retriever captures distinct relevance signals.

Regarding end-to-end answer quality on SQuAD, RAG-augmentation yields substantial gains across F1, ROUGE-L, and BERTScore, especially with smaller models such as LLaMA-3.2, indicating that external knowledge is essential for effective generation when parametric memory is limited. For instance, LLaMA-3.2 F1 increases from 0.0666 (zero-shot) to 0.4850 (Tri-Retrieval), and GPT-4o achieves a state-of-the-art 0.7181 F1.

Subjective and Cross-Cultural Evaluation

A human-subjects study (n=21, university students; Vietnam n=11, Australia n=10) was conducted to benchmark system usability, empathy, and coherence. The experiment contrasts the proposed RAG system with an LLM-only baseline in a counterbalanced setup, measuring Effectiveness, Coherence, and User Perception.

Figure 3: Example of participant interaction during the subjective evaluation session.

Statistical analysis using Wilcoxon signed-rank tests yields the following:

Coherence: The RAG system significantly outperforms the baseline ( $p = .0018$ , $r = .52$ ), confirming that context fidelity benefits from retrieval and memory fusion.
User Perception: Ratings for the RAG system are significantly higher ( $p = .0007$ , $r = .57$ ), attributed to consistent personalization, factual accuracy, and displays of empathy.
Effectiveness: Improvement is positive but weaker ( $p = .089$ ), attributed to ceiling effects and short interaction duration.
System Preference: 90.5% of users express a clear preference for the RAG-augmented agent ( $p < .001$ ).
Figure 4: Cross-cultural analysis establishes RAG-driven gains in Coherence and User Perception across Vietnamese and Australian cohorts.

Cross-cultural subgroup analysis further establishes robustness of these findings; both cultural groups show significant gains in coherence and user perception, indicating that the benefits of RAG-augmentation and memory retention generalize across cultural boundaries.

Theoretical and Practical Implications

The integration of Tri-Retrieval RAG with dual-tier memory consolidation directly addresses the primary challenges of hallucination, shallow empathy, and generic dialogue that have limited prior wellbeing-focused agents. Notably, the approach demonstrates that advanced RAG pipelines are crucial not only for factuality but also for sustained personalization and context-aware empathy. Furthermore, the safety LLM ensures that deployments in sensitive clinical or educational environments can enforce robust content moderation without sacrificing autonomy.

The cross-cultural consistency of the enhancements points toward practical viability for globally scalable agents, provided that legal, privacy, and localization requirements (especially data encryption and adversarial robustness) are met.

Future Research Directions

Longitudinal trials are needed to investigate retention and therapeutic impact over extended timescales beyond the 10-15 minute sessions explored in this study. Additionally, adversarial attacks on the safety filter, dynamic adaptation to emotional trajectories, and comprehensive data privacy (including encryption and differential privacy for long-term user memory) remain to be addressed. Extending the agent to diverse languages and demographic groups will further evaluate its generality.

Anticipated future developments include adaptive retrieval policies conditioned on both conversational context and real-time emotional cues, as well as integration of self-reflective architectures to improve memory fusion and response accuracy dynamically.

Conclusion

The proposed virtual agent framework demonstrates significant advances in delivering empathetic, personalized, and robustly factual conversational support for mental wellbeing applications. Its combined Tri-Retrieval, personalized memory, and multimodal rendering architecture achieves statistically significant improvements in coherence and perceived empathy and is preferred by the vast majority of users in both Vietnamese and Australian cohorts. The results indicate that knowledge-grounded dialogue agents can provide scalable, culturally inclusive, and engaging wellbeing support, though further development is required to ensure privacy, safety, and longitudinal efficacy.

Markdown Report Issue