- The paper introduces a novel benchmark for evaluating persona-sensitive influencing, focusing on implicit user modeling in persuasive dialogues.
- It employs multi-turn evaluations across three persuasive scenarios with metrics like Conversation Quality, Personalized Response Level, and Persuasion Effect.
- The results show that reinforcement learning-based profile analyzers significantly improve persuasion effectiveness, underscoring the importance of adaptive user modeling.
Ψ-Bench: Advancing the Evaluation of Persona-Sensitive Influencing in Persuasive Dialogues
Ψ-Bench introduces a rigorous framework for evaluating LLMs in the persona-sensitive influencing paradigm, positioning LLMs not merely as passive responders to user queries but as proactive agents capable of tailoring persuasive strategies to individual user profiles. Prevailing benchmarks and research efforts in LLM personalization predominantly focus on passive adaptation to user preferences, neglecting proactive, client-targeted influencing within realistic dialogic contexts. As AI language agents are increasingly integrated into decision-making support, therapy, recommendation systems, and other sensitive domains, the need for systematic, scalable benchmarking of persona-grounded influencing capabilities becomes acute. Ψ-Bench addresses this methodological gap by operationalizing and quantifying LLMs' abilities to reason about, infer, and adapt to diverse, latent user characteristics during interactive, persuasive dialogues.
Benchmark Architecture and Evaluation Protocols
Ψ-Bench structures evaluation around three canonical influencing scenarios: (1) Viewpoint Debate (opinion change), (2) Psychological Consultation (mindset shaping), and (3) Everyday Request (behavioral compliance). Each scenario draws on distinct real-world data sources (e.g., Webis-CMV-20 for debate, CounselBench for consultation) supplemented by high-fidelity, profile-grounded client simulators. The benchmark defines a granular persona schema (adapted from PersonaMem-v2), encompassing demographic, psychosocial, and linguistic attributes. Simulated clients role-play their assigned personas throughout the multi-turn interaction, obfuscating profile information from the evaluated LLM, thereby enforcing an implicit profile inference and adaptation protocol representative of real-world task complexity.
Evaluation is performed via LLM-as-a-judge metrics: Conversation Quality (overall coherence and appropriateness), Personalized Response Level (extent of adaptation to client profile), and Persuasion Effect (actual opinion or behavioral shift), scored by a capable LLM-based judge on a 9-point ordinal scale. Empirical alignment with human annotation (e.g., ROC-AUC 0.77 for Personalize–Effect correlation) substantiates the reliability of the automated evaluation framework.
Analysis of Experimental Results
A diverse suite of 10 advanced LLMs (Qwen, DeepSeek, Gemini, GPT-5 series, Grok) were subjected to Ψ-Bench under profile-hidden conditions. All models demonstrated acceptable baseline dialogue quality, with Quality scores >7 across most settings. However, the ability to leverage implicit user modeling for persuasion was consistently deficient: the average Effect scores for leading models (e.g., GPT-5.1) remained below 6, with weaker models clustering around 4. Notably, access to explicit client profiles ("Oracle" condition) resulted in a statistically significant performance increase: average Effect scores improved by 18.24%, and Personalize scores by 41.19%, unambiguously demonstrating that effective persona-sensitive persuasion requires accurate user modeling, not only rhetorical proficiency.
The multi-turn analysis showed pronounced stratification by model scale: stronger LLMs (e.g., GPT-5.1, DeepSeek-v4-pro) benefited more from extended interactions via sustained context integration, whereas smaller models exhibited performance saturation and increased argument repetition, degrading their Personalize and Quality scores over time.
Profile Analyzer as a Bridge for Implicit Personalization
To address real-world settings where explicit client profiles are unavailable, Ψ-Bench incorporates a profile analyzer module. This module, realized both as zero-shot LLMs and as a reinforcement learning (GRPO)–trained lightweight model (Qwen3-4B-RL), infers structured client profiles from partial dialogue histories. Integrating inferred profiles as contextual input into persuader LLMs yields significant downstream improvements in persuasion: models equipped with RL-trained analyzers obtained up to 9.1% higher Effect scores relative to profile-agnostic baselines and in some instances approach "Oracle" performance. Ablation with irrelevant (mismatched) profiles resulted in negligible benefit, verifying attribution of improvement to profile inference correctness. Generalization to out-of-domain scenarios (i.e., scenarios not seen during trainer exposure) further supports the robustness of the RL-based profile analysis paradigm.
Implications, Theoretical and Practical
The study delineates clear boundaries for current LLMs: strong syntactic and rhetorical fluency does not suffice for effective, personalized influence. The central bottleneck lies in the ability—both architectural and algorithmic—to model, infer, and exploit latent psychosocial characteristics dynamically within dialogic contexts. This has implications for the development of adaptive assistants, negotiation agents, and therapeutic AI, where static, generic responses are insufficient or even counterproductive.
Moreover, Ψ-Bench's multi-faceted evaluation—grounded in simulated but high-fidelity profiles—offers a reproducible and scalable alternative to costly human annotation, without critical loss of metric fidelity. The adoption of profile analyzers (especially those based on reinforcement learning for latent attribute extraction) outlines a promising trajectory for fusing user modeling research with dialogic LLM architectures, suggesting a dual-stack pipeline: profile inference as an online, continual subroutine and persuasion as a downstream controlled generation task.
Limitations and Future Directions
While Ψ-Bench achieves wide persona diversity, coverage limitations remain (e.g., under-representation of non-digital literate or culturally atypical users) and quadratic scaling of scenario–persona combinations is computationally intensive. Further, explicit profile access in deployment contexts is seldom realistic, placing increased emphasis on the advancement of implicit, continual user modeling.
Promising avenues for future research include: (1) adaptation of Ψ-Bench for spoken or multimodal persuasion settings, (2) extension to complex, multi-agent conversational environments, and (3) further integration of explicit theory-of-mind reasoning mechanisms within dialogue LLMs [see also related work on ToM-based persuaders in (Han et al., 29 May 2025)].
Conclusion
Ψ-Bench represents a substantive advance in the benchmarking, analysis, and understanding of persona-sensitive influencing by LLM agents (2606.02754). The results highlight the inadequacy of purely generic or context-agnostic architectures for real dialogue-based persuasion and urge greater focus on explicit user modeling—either via access to structured profiles or strong online inference. The benchmark, supporting code, and analytic paradigm set a robust foundation for developing and measuring the next generation of adaptive, user-aware conversational agents.