- The paper demonstrates that integrating diverse interactive tools with LLMs significantly enhances personalized social support beyond simple text-based interactions.
- The authors introduce ComPASS-Bench, a novel benchmark with profile-based and history-based settings to evaluate tool-augmented digital companionship effectively.
- Results show that fine-tuned ComPASS-Qwen outperforms larger models in actionable support and preference alignment, showcasing a cost-effective innovation in AI companionship.
Motivation and Problem Statement
Building interactive agents capable of substantive, personalized, and context-sensitive social support is a longstanding challenge in AI, with significant implications for digital companionship, mental health, and human-computer interaction. Traditional work on affective dialogue systems, such as empathetic and emotional support conversations, remains inherently limited by a constricted action space centered on text generation. These systems often deliver supportive responses in a monolithic, stylized fashion, failing to tailor responses to usersโ multifaceted and evolving needs, particularly as defined by psychological taxonomies of social support.
ComPASS addresses these limitations by shifting from pure conversation to a paradigm where LLM-based agents leverage a diverse suite of interactive, user-centric tools to provide substantive, personalized, and context-adaptive social supportโtranscending mere empathy to include information provision, instrumental assistance, and social companionship. This approach operationalizes social support in a manner more closely aligned with human interaction and psychological theory.
The core innovation is the design and implementation of an extensible tool-invocation environment modeled on real-world multimedia applications. The toolset covers four principal social support categories: emotional, informational, instrumental, and social companionship. Twelve distinct tool prototypes were implemented across five domains (information, communication, entertainment, business, and education systems), including but not limited to psychological knowledge retrieval, music and media recommendations, scheduling, medical assistance, and expressive role-playing.
Tool assignment and coverage of social support dimensions were systematically validated by psychology researchers, leveraging Cohenโs taxonomy, and demonstrated moderate agreement (Fleissโ Kappa 0.45), confirming comprehensive and balanced coverage. The toolsโ operationalization is lightweight yet sufficient for robust model evaluation, balancing retrieval/recommendation, stateful operations, and generation-based responses.
Benchmark Design: ComPASS-Bench
ComPASS-Bench is introduced as the first benchmark dedicated to evaluating personalized social support in LLM-based agents, spanning two distinct interaction modes:
- Profile-based setting: Agents are supplied with complete user profiles, including demographics, background, and fine-grained preferences.
- History-based setting: Agents infer preferences solely from multi-turn interaction histories, simulating cold-start and adaptation scenarios.
User profiles are synthetically generated via an LLM-based pipeline, ensuring demographic realism (constrained by global statistics and occupation databases) and psychological diversity (Big Five traits, background details, tool affinities). Situational diversity is induced by sampling from a broad emotion space, referenced to EmpatheticDialogues and manually verified for quality and coherence.
The resulting dataset comprises 500 users (400 train, 100 test) with 15 temporally and affectively diverse interaction scenarios each, amounting to 7500 agent-user situations. Comprehensive automatic and manual checks ensure high fidelity and correctness.
Model Development and Evaluation Protocol
A key experimental contribution is the synthesis of tool-use records and corresponding tool-augmented response samples using GPT-5.1, followed by a rigorous two-stage verification to ensure alignment with user preferences and contextual fit. Both positive (preference-aligned) and adversarial (preference-opposed) instances are constructed, supporting robust supervised fine-tuning.
A task-specialized model, ComPASS-Qwen, is realized by LoRA-based SFT of Qwen3-8B on the curated synthetic data. The evaluation regime includes:
- Execution pass rate for tool invocation (validity of API-like actions)
- Distinct-n metrics for linguistic diversity
- Six subjective axes scored on five-point Likert scales (empathy, helpfulness, preference alignment, informativeness, fluency, safety), both via LLM-based (Kimi-K2.5) and human annotation (with high measured correlation, avg Pearson r = 0.66).
Experimental Findings and Comparison
Tool-use as a catalyst for social support quality: All evaluated LLMs (GPT-5.1, Gemini-3-Pro, Claude-Sonnet-4.5, DeepSeek-V3.2, multiple Qwen variants, Llama3-8B) demonstrated high validity in tool invocation. However, clear gaps appeared in the ultimate supportive quality of responses, with tool-augmented generation yielding markedly better results across most axes than standard empathetic-only approaches. The gains were especially evident in helpfulness, informativeness, and preference alignment, without degradation in empathy or safety.
ComPASS-Qwen performance: Fine-tuned ComPASS-Qwen matched or exceeded the performance of significantly larger closed-source models (e.g., Qwen3-32B, Gemini-3-Pro) in subjective and objective metrics. Notably, it outperformed larger models in preference alignment under the profile-based setting, demonstrating the effectiveness of targeted SFT on high-quality, interactionally rich data.
Learning from history: In the history-based setting (requiring adaptation to emergent preferences), large closed models showed positive gains leveraging user history, but most smaller open models did not. ComPASS-Qwen, uniquely among the smaller models, improved with history-based learning due to its specialized exposure during fine-tuning.
Tool vs. content generation: Stage-wise analysis in a decoupled pipeline showed that response quality is more bottlenecked by the sophistication of content generation from tool outputs than by tool selection itself, emphasizing the need for strong integration between action and language modules.
Comparison with empathetic-only models: Tool-augmented responses by both GPT-5.1 and ComPASS-Qwen yielded superior overall scores than standard empathetic generation (GPT-Emp, Sibyl), particularly in delivering actionable, personalized, and informative support.
Implications and Future Directions
ComPASS conclusively establishes that substantive, tool-augmented agentic interaction is both technically feasible and essential for advancing beyond superficial empathy in digital companionship systems. The findings emphasize the importance of multi-modal, context-sensitive action, and the necessity of specialized data to unlock the full potential of smaller, more efficient LLMs in social domains.
Theoretically, the tool-augmented paradigm supports studying AI systems through a comprehensive lens that incorporates not just affect recognition but actionable planning and environment interaction, mirroring human social networksโ provision of support. Practically, the results suggest that low-resource, high-quality supervision can yield cost-effective yet high-performing agents, broadening access and research potential.
Future directions may encompass:
- Extending tool integration to real-world, multi-modal, and personalized applications (e.g., social robotics, mobile mental health).
- Continuous adaptation under lifelong or open-ended user modeling.
- Investigation of agent proactivity, long-term relational modeling, and multi-agent cooperation for network-based social support.
- Enhanced safety and ethical oversight, especially as instrumental actions have real-world consequences.
Conclusion
ComPASS marks a substantive advance in interactive AI by proposing, implementing, and rigorously evaluating a paradigm shift towards tool-augmented, personalized social support. Through nuanced benchmark design, robust model training, and comprehensive evaluation, the work demonstrates that agentic tool-use enables significant improvements in both practical and personalized supportโwhile also narrowing the performance gap between smaller, efficient LLMs and their major-scale counterparts. This framework and resource open new avenues for principled and scalable research on intelligent companionship systems.