RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

Published 3 Jul 2025 in cs.CL, cs.AI, and cs.CY | (2507.03112v1)

Abstract: LLMs excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM's learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends--thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.

Abstract PDF Upgrade to Chat

Authors (16)

First 10 authors:

Summary

The paper introduces RLVER, a reinforcement learning framework that leverages verifiable emotion rewards to train LLMs for empathetic dialogue.
It employs self-consistent affective user simulators and a 'Think-Then-Say' approach to improve reasoning and empathetic response quality.
Empirical results demonstrate significant advances in empathy, insight, and dialogue capability, achieving scores comparable to proprietary models.

RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents

Introduction

The paper introduces RLVER, a reinforcement learning framework designed to improve the emotional intelligence of LLMs with verifiable emotion rewards. It aims to address the limitations of LLMs in empathetic dialogue by enhancing their ability to understand and respond to human emotions. The framework employs self-consistent affective user simulators to generate deterministic emotion scores that serve as reward signals for training LLMs.

Figure 1: Framework of the reinforcement learning with verifiable emotion rewards (RLVER).

Reinforcement Learning Framework

RLVER utilizes Proximal Policy Optimization (PPO) to train LLMs within an environment powered by affective user simulators. These simulators engage in dialogue with LLMs, updating their emotional states after each interaction. The updated emotion scores provide feedback that trains the LLM to respond empathetically.

Emotion Rewards and User Simulation

At the core of RLVER is the Sentient Agent framework, which simulates human-like emotional responses and reasoning based on predefined personas and goals. The agent's responses and emotional states evolve throughout the conversation, and the process is structured to support realistic dialogue simulations. By providing deterministic and verifiable emotion scores, RLVER circumvents the problem of reward hacking often encountered in neural reward models.

Impact of Training Templates

The framework assesses LLM behavior using two distinct prompting formats: "Think-Then-Say" and direct reply templates. The "Think-Then-Say" format encourages the model to engage in explicit reasoning before generating a response, which has been shown to enhance higher-order empathetic skills.

Performance Evaluation

The empirical evaluation demonstrates significant improvements in several dimensions of dialogue capability, including empathy, insight, and action orientation. Using the Sentient Benchmark, RLVER-trained models attain scores comparable to those of proprietary LLMs, showcasing a substantial improvement in emotional support dialogue tasks.

Figure 2: Qualitative analysis of five core capabilities of the trained models.

Strategy Utilization

Analyzing the frequency and contribution of empathetic strategies reveals that models trained with RLVER gradually adopt more sophisticated approaches. These strategies are crucial for achieving genuine emotional understanding and providing tailored responses.

Figure 3: Frequency of empathetic strategies during the training.

Stability and Trade-offs

While PPO facilitates exploratory updates that refine capabilities over time, GRPO offers stable improvements, illustrating a trade-off between stability and performance ceiling. The framework effectively balances exploration and exploitation, ensuring the growth of complex empathetic strategies while maintaining core linguistic competencies.

RLVER models undergo a distinct shift from solution-centric to empathy-oriented interactions. This transformation in behavior is evaluated using the Social Cognition Coordinate, which provides insight into the model's style and empathy orientation after training.

Figure 4: Learning curves in the Social Cognition Coordinate (SCC).

Conclusion

RLVER represents a practical and scalable approach for enhancing the emotional intelligence of LLMs. By grounding reward signals in verifiable emotion scores from sophisticated user simulators, the framework achieves balanced growth across empathetic competencies while still addressing practical dialogue requirements. Future research could explore further integration of multimodal affective cues and adaptive persona switching to achieve holistic conversational intelligence.

Markdown Report Issue